New Survey Questions and Estimators for Network Clustering with Respondent-Driven Sampling Data

1 New Survey Questions and Estimators for Network Clustering with Respondent- Driven Sampling Data Ashton M. Verdery † 1 , Jacob C. Fisher 2 , Nalyn Siripong 3 , Kahina Abdesselam 4 , Shawn Bauldry 5 † Corresponding author; (1) The Pennsylvania State University ; (2) Duke Universit y; (3) Univer sity of North Carolina – Chapel Hill; (4) University of Ottawa; (5) Purdue University Main text word count: 7873 Abstract Respondent-driven sampling (RDS) is a popular method for sampling hard - to -survey populations that leverages social network connections through peer recruitment. While RDS is most frequently applied to estimate the prevalence of infections and risk behaviors of interest to public health, like HIV/AIDS or c ondom use, it is rarel y used to draw inferences about the structural properties of social networks among such populations because it does not typically collect the necessary data. Drawing on recent advances in computer science, we introduce a set of data collection instruments and RDS estimators for network clustering, an important topological property that has been linked to a network’s potential for diffusion of information, disease, and health behaviors. We use simulations to explore how these estimators, originally developed for random walk samples of computer networks, perf orm when applied to RDS sampl es with characteristics encountered in realistic fie ld settings that depart from random walks. In particular, we explore the effects of multiple seeds, without vs. with replacement, branching chains, imperfect response rates, preferential recruitment, and misreporting of ties. We find that clustering coefficient estimators retain desirable properties in RDS samples. This paper takes an important step towards calculating network characteristics using non- traditional sampling methods, and it expands RDS’s potential to te ll researchers more about hidden populations and the social factors dr iving disease prevalence. Acknowledgements : We thank M. Giovanna Merli, Ann Jolly, and Anne DeLessio- Parson for providing information about aspects of the empirical cases we examine . We acknowledge assistance provided by the Population Resea rch Institute, which is supported by an infrastructure grant by the Eunice Kennedy Shrive r National Institute of Child Health and Human Development (R24-HD041025), and from a seed grant provided 2 by the Institute for CyberScience at Penn State University. Portions of this research were funded by NCHS grant 1R03SH000056-01 (Verdery PI). 3 Introduction Researchers in many fields are interested in populations that ca nnot be sampled by conventional methods because they are rare, lack a sa mpling frame, or unwilling to participate in traditional survey protocols. Such groups, known as hidden populations (Heckathorn 1997), are often marginalized and at high risk of infections like HI V/AIDS. Respondent-Driven Sampling (RDS) is a set of methods for sampling and making inferences about hidden populations that has proliferated throughout the social sciences and public health (Malekinejad et al. 2008; White et al. 2012). RDS uses a without- replacement “link - tracing” approach, similar to snowball sampling, where ea ch respondent attempts to recruit a limited number of her personal network contacts in the target population until the desired sample size is attained. RDS offers a popular, quick, cost-effective, and anonymous approach for sampling understudied groups like the homeless, drug users, or commercial sex workers that claims to provide asy mptoticall y unbiased estimates of the population mean under limited conditions (Volz and Heckathorn 2008; Salganik and Heckathorn 2004). Though concerns exist about RDS’s validity (Gile and Handcock 2010; Verdery e t al. 2016; Merli, Moody, Smith, et al. 2015; Lu et al. 2013; Lu et al. 2012; Goel and Salganik 2010; Tomas and Gile 2011; McCreesh et al. 2012; Fisher and Merli 2014; Crawford et al. 2015), continued development of estimators, diagnostics, and reporting protoc ols are increasing its legitimacy (Lu 2013; Verdery et al. 2015; Gile 2011; Gile and Handcock 2011; Gile, Johnston, and Salga nik 2015; White et al. 2015; Nesterko and Blitzstein 2015; Yamanis et al. 2013; McCreesh et al. 2013; Crawford 2016). 4 Most RDS studies focus on prevalence estimation – that is, estimation of the population mean or proportion of a focal attribute like condom use – and avoid making inferences about other relevant estimands. We focus on network structure and, in particular, clustering. The structure of both social and contact networks is a key component of the risk environment for members of hidden populations (Rhodes and Simic 2005) with important implications for disease transmission (Schneider et al. 2012; Morris et al. 2009) and health behaviors (Centola and Macy 2007). Highly clustered risk networks, like sexual contact networks or share d needle networks, can lead to more redundant paths, making disease transmission more likely (Mood y 2002) and altering the relationship between concurrency and epidemic potential (Moody and Benton 2016). Clustering can also have benefits. Highly clustered friendship networ ks lead to normative reinforcement, and can increase individual likelihood s of engaging in and spreading health-promoting behaviors like joining an internet-based health forum (Centola 2010), adopting modern contraceptives (Kohler, Behrman, and Watkins 2001), abstaining from illicit drugs (Silverman et al. 2007), getting tested for HIV (Karim et al. 2008), or avoiding unprotected sex (Lippman et al. 2010). Normative reinforce ment through clustering can also drive unhealthy behaviors, such as sexual concurren cy (Yamanis e t al. 2015). Despite its sociological and epidemiological importance, few studies of hidden populations using RDS have directly examined network structure . This is b y design: because field implementations of RDS require that samples be conducted without replacement and with maximal anonymity, typical RDS samples have limited opportunity 5 to measure network structure beyond recruiter-recruit relationships. Some have proposed using RDS to measure homophily (Wejnert 2010), or the tendency for people with similar attributes to be tied (McPherson, Smith-Lovin, and Cook 2001), but these approaches are flawed (Crawford et al. 2015) and there have been few developments since. Others have fit exponential random graph models to RDS data (Merli, Moody, Smith, et al. 2015; Gile and Handcock 2011), but learning about networks themselves was not the primary purpose of these studies. The ability of RDS studies to estimate network structure is important, however, because without closer attention to network characteristics that influence risk behaviors and sexually transmitted infections, RDS studies will be unable to offer a comprehensive picture of the dynamics driving e pidemic transmission or oth er network diffusion processes. This paper focus es on the performance of recently developed estimators of network clustering that can be applie d to RDS data. Work in computer science has proposed clustering estimators for data obtained via random walk sampling (RWS) (Hardiman and Katzir 2013), which is an alternative link-tracing sampling design more appropriate for computer networks than human populations. RDS procedures depart from RWS in several important ways that call into question whether such estimators can apply to RDS. We interrogate this question throughout the pape r. First we discus s measures of network clustering, then we introduce their estimation in network censuses vs. samples and review how RDS differs from RWS. Throughout, we focus on RDS data collection strategies that could inform clustering estimators, which leads us to introduc e two alternative survey question approaches for RDS. We next use simulation procedures to 6 evaluate whether our proposed survey questions and estimators of network clustering are appropriate for RDS data, focusing on bias, sampling variance, and total error. We then discuss how our proposed survey questions perform in 6 empirical RDS surve y s. Our results indicate that the estimators maintain reasonable properties with RDS data and that the questions have good empirical properties. These findings lead us to suggest that researchers add clustering que stions and estimators to RDS protocols to further explore network structure. We conclude by focusing on the potential benefits of clustering estimation with RDS data. Background Initial Notation The following notation guides our discussion throughout the pa per. For illustrative purposes, we rely on Figure 1, which shows a) a hypothetical population (i.e., nodes A through I); b) the social network linking its members (solid lines connecting nodes); c) a hypothetical time -ordered RWS link-tracing sample starting from node A (dashed, directed, and numbered lines); and d) a table counting relevant nodal statistics shown (on the right). Note that item (c) refers to a random walk sample (RWS) rather than a respondent-driven sample (RDS); in an RDS sample, node E would be ineligible to be sampled a second time because RDS is conducted without replacement. Below, w e review this and other differences between RWS and RDS that together call into question whether clustering estimators designed for RWS can be applied to RDS samples. 7 Figure 1. Example network with a hypothetical random walk sampling (RWS) sample and components needed to calculate local and global clustering coefficients for the whole network. We characterize a social network of  people as a graph  with nodes  representing people and undirected e dges  representing social ties. In Figure 1, we label nodes A through I and represent edges as undirected solid lines (we discuss the time - ordered, directed random walk steps shown with dashed and numbered lines below). We represent the graph as an    adjacenc y matrix,  , whose elements,    are 1 if there is a tie (edge) from person  to person  (i.e., when    ) and are 0 otherwise. For instance, there is an edge between nodes B and C in Figure 1 (but not between nodes A and B). We 8 follow standard practices in the RWS and RDS litera tures ( Lovász 1993; Hardiman and Katzir 2013; Volz and Heckathorn 2008) and consider an undirected graph with one component (see Lu et al. 2013 for the performance of RDS in directed networks). Since the network is undirected, the adjacency matrix  is s ymmetric and      for all       and       . We set the diagonal of  to 0 (i.e.,     for all       ). For convenience, we define                 as the degree of person  , meanin g how man y ties  has in the network. In figure 1, node A’s degree is 1 because he or she is only linked to one other node (E), while node B’s degree is 2 because he or she is linked to both E and C. I n empirical RDS studies, researchers typically estimat e degree by asking respondents questions like “how many people do you know (you know their name and they know yours) who have exchange d sex for money in the past six months?” (WHO 2013, 147). Some have studied the effect of inaccurate degree re porting on RDS estimates (Neely 2009; Lu 2013; Lu et al. 2012), but we assume accurate degree reporting. Clustering coefficients Watts and Strogatz (1998) introduced the clustering coefficient to characterize small world networks (Milgram 1967). Small world networks are a) highly clustered, meaning most ties between people appear in pockets of interconnection (see below), and b) have sh ort average path lengths, meaning that the minimum number of steps between network members is, on average, low (e.g., as embodied in the famous phrase “six - degrees of separation”). Clustering coefficients measure the first criterion. 9 Watts and Strogatz originally proposed a global measure of the clustering coefficient, defined as                      󰇛    󰇜   (1) where i, j, and k index unique respondents (Hardiman and Katzir 2013; Newman, Strogatz, and Watts 2001; Watts and Strogatz 1998). The global clustering coefficient (GCC) summarizes the overall network clustering by dividing the count of triangles by the count of connected triplets, where triangles are defined as sets of three individuals (  ,  , and  ) for whom cells   ,   , and   in the adjacency matrix  are all equal to 1 and connected triplets are defined as sets of three individuals (  ,  , and  ) where cells   and   are equal to 1. As such, triangles are a subset of connected triplets, ones with a connection in cell   . A node’s number of connected triplets is a function of his or her degree, i.e., node  ’s number of connected triplets i s   󰇛     󰇜 . Figure 1 ’s embedded table holds triangle and connected triplet counts for each node. The GCC of this graph is       . It is important to note that equation (1) cannot be evaluate d for mos t RDS studies without information on connections between unsampled peers. We introduce simple questions for RDS surveys that address this issue below. Extensions to the clustering coefficient concept consider the average amou nt of clustering among each ind ividual’s affiliates in the network. This second measure , the local clustering coefficient (LCC), is defined as                      󰇛    󰇜      (2) 10 The LCC measures the average of each individual’s number of triangles divided by his or her connected triplets. In Figure 1, the L CC is obtained by fi rst dividing triangles by connected triplets, then taking the average (when     , the value is set to 0). Thus, nodes A-C each contribute values of 0 to the LCC, while node D c ontributes a value of       and node E contributes a value of        , and so on. This graph ’s LCC is 0.5767. As with the GCC, the LCC cannot readil y be evaluated for many RDS samples. The key difference between the clustering coefficient measures is that the GCC captures the totality of network members ’ experience , which may be dominated by low clustering among high degree nodes, for instance, while the LCC captures the average experience of network members, where each person in the network is weighted equally. Although clustering coefficients are recent additions to the social networks literature, they resemble other important network characteristics, in particular, transitivity, ego-network density, and measures of clustering from the exponential random graph modeling framework. We omit detailed discussion of these alternate measures for the sake of brevity. Measuring Clustering in Network Censuses and Samples The calculation of many network-level statistics, including the clustering coefficient, assumes that researchers measure the entire adjacency matrix,  , in terms of cells (edges) and rows/columns (nodes). In Figure 1, it would be assumed that the researcher measured all ties (solid, undirected lines) and nodes (labeled A-I). Collecting 11 such saturated network data is challenging (Smith 2012), however, and often impossible for populations without clearly defined institutional boundar ies (e.g., schools). In other settings, either intentionally or not, re searchers do not collect data on all network members (node missingness), do not measure all relevant ties linking network members (edge missingness), or both. When researchers cannot conduct a census of the network, they often turn to samples. There are many approaches to collecting sampled network da ta, including randomly drawn samples (Marsden 1987; Krivitsky , Handcock, and Morris 2011; Smi th 2012; McPherson, Smith-Lovin, and Brashears 2006) and numerous link-tracing approaches (Goodman 1961; Heckathorn 1997; Volz and Heckathorn 2008; Mouw and Verdery 2012). We focus on the latter. Hardiman and Katzir Estimators Hardiman and Katzir (2013) introduce estimators for the LCC and GCC that use data gathered in an RWS sample, like that shown in Figure 1. Intuitively, for vertices           sampled via RWS, they estimate clustering with the presence of a tie between the vertices before and after the focal vertex. Typical RDS studies do not ask about the existence of this tie, though some ha ve (see application section below and Appendix B), and in the next section we propose two question formats for RDS studies to assess its existence. More formally, for a step  in a random walk,  , let   represent whether a tie is present between the vertex before   , i. e.,    , and the vertex after   , i.e.,    . In the random walk depicted in Figure 1, for instance,   would be 0 the first 12 time node E is sampled because nodes A and H are unconnected, but it would be 1 the second time node E is sampled because nodes F and I are connected. That is,     󰇛     󰇜 for each        , where   is the cell in the   row and the   column of the adjacency matrix, as before. Importantly,   is not calculated for th e first and last nodes of the walk, because the former has no recruiter and the la tter no recruitee. Next for the LCC, define a weighted sum of the  value as    󰇡    󰇢               . In this case,    represents the degree of the vertex   in the random walk and  is t he length of the random walk. Thus,   is the average of whether the previous vertex in the random walk (    ) and the following vertex in the random walk (    ) were tied, weighted by the probability of observing the current vertex. In RWS on an undirected, unweighted graph, the probability of observing a given ve rtex is the inverse of that vertex’s degree if the random walk is in the steady state, whic h is typically achieved if the walk is sufficiently long or started with steady state proba bilities (reviewed in greater depth in Verdery et al. 2016; Lovász 1993). We note t hat this finding cannot be assumed to hold for the finite, branching, without replacement samples conducted in RDS and future research may investiga te alternate weighting schemes. Finally, let    󰇛    󰇜           , representing the sum of sampled vertices’ reciprocal degrees. Hardiman and Katzir define an estimator of the LCC a s:         󰇡   󰇢    󰇧      󰇨   󰇡   󰇢  󰇧     󰇨   (3) 13 Hardiman and Katzir also develop an estimator of the GCC. Letting    󰇛  󰇛    󰇜  󰇜        and    󰇛    󰇜          , they suggest the following measure for the global clustering coefficient:         󰇡   󰇢         󰇡   󰇢        (4) Hardiman and Katzir use both analy tic proofs and simulation to show that their proposed estimators are asymptotically unbiased with m inimal variance for large RWS samples and that they produce more consistent results at any given sample size than other approaches that query each sampled node’s full ego network (counting ego network reports in the sample size). Although RDS does not rely on simple random walks, researchers may wish to apply the se estimators to RDS samples. The following section discusses RDS departures from RWS with special attention to the empirical contexts in which RDS studies are conducted. Within it, we propose new survey questions that researchers could employ to estimate clustering via the Hardima n and Katzir estimators. We examine how these questions perform in six empirical surveys in the discussion section. RDS Departures from RWS The Hardiman and Katzir estimators cannot immediately be applied to RDS studies in the field because they were developed for RWS, which differs considerably in core assumptions. Deviations of RDS from RWS have been shown in prior work to bias 14 other estimators, like that of the population mean (Gile 2011; Merli, Moody, Smith, et al. 2015; Tomas and Gile 2011) and sampling variance (Verdery et al. 2016), so we should not expect that a naïve application of Hardiman and Katzir ’s clustering coefficient estimators will yield viable estimates from empirical RDS samples. Table 1. Comparison of features of RWS and RDS. RWS RDS 1) Number of seeds One Multiple 2) Seed selection Proportional to steady state Convenience 3) Branching No Yes 4) Replacement Yes No 5) Link tracing efficacy 100% Less than 100% 6) Preferential recruitment No, researcher controls Yes, respondent controls 7 ) Sample size La r ge (more than 10,000) Small (less than 1,000) 8) Measurement of   Can be queried Asked of respondent Table 1 summarizes 8 RDS departures from RWS that may affect clustering estimation. A RWS sample of a network begins with selecting a single “seed” node , typically with probability proportionate to the steady state probability,        , where   is the degree of node  in the population and          is the number of edges in the population (Lovász 1993). By contrast, most RDS protocols recommend initiating the sample by identify ing, often by convenience, eight to ten members of the hidden population who are willing to participate, have large personal networks with other members of the target population, and are diverse with respect to relevant focal attributes, such as years injecting drugs (WHO 2013, 71 – 82). A first consequence of this distinction is that RWS samples lead to a single c hain in a network (as in the h ypothetical chain depicted in Figure 1), whereas RDS samples start from multiple points and y ield multiple 15 chains. A second consequence is that RDS samples often exhibit seed dependence, whereas RWS samples do not (Gile and Handcock 2010). RWS and RDS also differ in their approach to tracing links. RWS samples proceed without branching (i.e., one coupon), while RDS almost always allow branc hing in practice through the distribution of two or three recruiting coupons to each respondent (Goel and Salganik 2009). RWS samples are conducted with replacement while RDS is conducted without replacement, which means that recruitment becomes competitive (Heckathorn 1997; Barash et al. 2016; Gile and Handcock 2010; Gile 2011; Crawford 2016). Other differences arise because RWS is researcher-driven (or algorithm-driven), while RDS is respondent-driven. In RDS, respondents must identify, approac h, and successfully recruit peer s, whic h can yield less than perfect link tracing efficacy and introduce preferential recruitment (Merli, Moody, Smith, et al. 2015; Verdery et al. 2015). Sample size is another distinction because RWS samples are used in computer science or fields where costs of sampling additional individuals is low compare d to RDS in human populations (Mouw and Verdery 2012) . For instance, Hardiman a nd Katzir examine their estimators performance in four large network s with 1% samples of siz es     ,      ,      , and      . B y contrast, Malekinejad et al. (2008) report attained sample sizes for 63 RDS studies, ranging from    to    , with a median    . A first consequence of smaller samples is that RDS samples are more likely to contain finite sampling bias e ven when assumptions are met because the samples are too small for asy mptoticall y unbiased RDS estimators to minimize bias. A 16 second consequence of small RDS samples is that they are likely to violate the RDS assumption that the sample is “ in equilibrium” , a fact exacerbated by convenience sampling seeds (Gile and Handcock 2010; Wejnert 2009). The final departure of RDS from RWS is anonymity , which pertains to the measurement of   , whether person  ’s recruiter knows person  ’s recruitee . First, unlike in computer or online networks where it is comparatively easy to determine for each node   in the random walk, whether the prior node,    , is tied to the subsequent node,    , this task is more challenging in an RDS sample of a human population. One cannot seek    in a stored contact list of node    or otherwise backtrack the sample for direct measurement; rather, the existence of this tie must be elicited from respondents themselves during a period when the respondent is answering the survey, which can introduce measurement error and other challenges . The timing of recruitments and preservation of anonymity in RDS mean that a) researchers cannot ask about recruitments that have not yet occurred (e.g., cannot ask A whether he or she is tied to H in the RWS in Figure 1), and b) researchers cannot divulge who rec ruited whom to respondents (e.g., cannot tell H that A recruited E). The middle recruit is the only fea sible person to ask about this tie’s existence in an RDS sample (E in this example), although this require s E to report on a tie that exists between two of his alters and thus may introduce re porting error (a topic we examine below). In many RDS surveys, a majority of respondents participate twice, once when they are recruited themselves (primary interview) and a second time when they return to the research site to collect additional incentives for succ essfully recruiting peers 17 (secondary interview). Acknowledging this interview timing, we propose two questions that researchers can ask RDS respondents to feasibly elicit information about potential ties between    and    : A ) [In the secondary int erview] . “ Does the person who gave you the coupon know the person who you gave the coupon to or vice versa. ” (We refer to this from here on as the binary question format). B ) [In the primary or s econdary interview] . “ What percent of people who you know in the population does the person who g ave y ou the coupon know. ” (We refer to this from here on as the percentage question format). 1 The binary question forma t garners the exact information required by the Hardiman and Katzir estimators, but it relies on the accuracy of respondent reports about recruiter-recruitee relationships. It also can only be estimated on a subset of sampled cases, as it cannot be asked until the secondary interview (after recruitment). The percentage question form at differs from Hardiman and Katzir ’s suggested approach , but it can be aske d during either the main survey (of all respondents) or the follow-up interview (of the subset of respondents who recruit). If asked in both, researchers can check test-retest validity and potentially diagnose respondent comprehension problems. Of course, there are other possible ways to ask such questions in RDS surveys, but our proposed approaches are flexible in terms of implementation and preserve the desirable confidentiality of standard RDS studies. 1 Note, many studies do not ask respondents dir ectly for the percentage. Rather, the y ask them to repor t per sonal network size ( e.g., “A1. How many ad ult sex workers do you know who live in this city?”), the n to report the number kno wn by the recruiter (e.g., “A2. Of t he number in A1, ho w many are known by the person who gave you the coupon?”) . Percentages can b e calculated dir ectly from this pair of questions. W e review six surveys that as ked variants of the q uestions needed to calculate the clusteri ng coefficient estimators in Appendi x B. 18 Data and Methods Approach We first evaluate the performance of Hardiman and Katzir ’s estimators applied to RDS through simulation methods. We aim to understand the effects of increasingly large departures from RWS, toward more realistic situations encountered within RDS data collection. To do this, we simulate data collection from underlying population social networks. It is notoriously difficult to obtain analytical results for RDS estimators, which is why many prior developments have tested proposed estimators through simulation. We test scenarios driven by data collec tion parameters to m atch how RDS departs from RWS, drawing 1000 samples in each scenario. It is important to draw multiple samples per scenario to determine the estimators’ distributional properties (bias, sampling variance, and total error). For each simulated sample, we calculate the Hardiman and Katzir LCC and GCC estimators implemented with both question form at s we proposed. We compare these sample estimates to the parameters in the population social networ k (or as would be calculated in a census). After examining how Hardiman and Katzir ’s estimators perform in simulations, we evaluate thei r feasibilit y in six empirical RDS samples. 19 Data Figure 2. Largest Weakly Connected Component of Project 90 Data Set, Nodes Colored by Race (Grey = White; Black = Non-White) and Sized by Degree. The network is displayed using the ForceAtlas2 algorithm, with no node overlap, in Gephi 0.9. We first simulate link-tracing samples from a hidden population social network of heterosexuals, sex workers, and injecting drug users at elevated risk of HI V/A IDS collected beginning in 1987 as part of the Project 90 stud y in Colorado Springs, CO (Potterat et al. 2004; Woodhouse et al. 1994; Rothenberg et al. 1995; Klovdahl et al. 20 1994). The project aim ed to assess how network structure affected disease transmission, and, as such, the researchers sought to obtain a ce nsus of the hidden population and their links to one another. These data have previously been used in prior RDS assessments (Goel and Salganik 2010) and are made available to researchers by the Office of Population Research at Princeton University (“Office of Population Research, Princeton University” 2015) . We focus on 4,111 individuals linked by 17,164 ties that remain in the network’s largest weakly connected component after drop ping cases lacking valid attribute codes. Figure 2 shows the network linking members of this population, with nodes colored by a key structuring variable (white/non-white). Whites make up 74.7% of network members, while 20.6% of ties cross race categories. Nodes of different races group together in different parts of the figure, but there are many cross group links. Table 2. Summary network statistics for data sets analyzed in this paper. Network Nodes Edges Density GCC LCC Cross group ties** Project 90 4,111 34,328 0.002 0.657 0.348 0.206 Facebook Nets* Minimum 4,985 212,114 0.004 0.200 0.135 0.015 25 th Perce ntile 5,930 367,486 0.008 0.216 0.152 0.032 Median 6,877 503,939 0.013 0.231 0.167 0.038 75 th Perce ntile 7,840 705,501 0.014 0.241 0.179 0.054 Maximum 9,693 905,428 0.017 0.276 0.199 0.163 Notes: *Statistics presented for the Facebook networks are com puted separate ly, the largest network does not necessarily have the largest proportion of cross group ties, for instance; **Cross group ties refer to ties that cross white/non-white categories in Project 90 and ties that cross freshmen/non-freshmen categories in the Facebook networks. To understand how the Hardiman and Katzir estimators perform across a range of networks, we also examine additional networks from a data set of 100 Facebook 21 networks collected in 2005, which have also been subject to intensive examinations in prior simulation evaluations of RDS (Mouw and Verdery 2012; Verdery et al. 2016). Importantly, because they were collected when Facebook was new and membership restricted to those with college email addresses, researche rs have argued that these networks represent realistic, offline social and interaction network s (Traud, Mucha, and Porter 2012; Traud et al. 2011; Clouston et al. 2009). We restrict analy sis to 29 universit y networks where the largest connected component of users with va lid attribute codes contained between 5,000 and 10,000 nodes, si ze restrictions we put in place to avoid without replacement sampling effects (Barash et al. 2016) and to maintain computational tractability. The Project 90 network is smaller, less dense, more clustered, and less homophilous than the Facebook networks. Scenarios We provide a replication file for researchers interested in re plicating and expanding our scenarios for the Project 90 network, which is publicly available da ta. In both data sets, w e focus on five scenarios designed to test the bias, sampling variance, and error of Hardiman and Katzir’s estimators when used with standa rd RDS protocols as opposed to simple RWS. Table 3 shows what key features we manipulate in each scenario. We first simulate collecting simple random walks (“RWS baseline”). These scenarios begin from a single seed selected with steady state probabilities, are conducted with replacement, do not branch, experience 100% link-tracing efficacy without preferential recruitment, and do not contain any measurement error for   . 22 We then selectively relax para meters until the samples resembles the standard RDS protocol. We start with a scenario designed to mimic an ideal case of RDS constrained by the method’s actual implementation in the field (“RDS baseline”). This scena rio’s samples begin from 10 seeds selected via convenience sampling (implemented as uniform random seed selection in the main text; in appendix A we consider four other seed selection scenarios and find that they did not alter our results), are conducted without replacement (recruitment is competitive between respondents), and may branch up to three ways from each responde nt (i.e., each respondent is simulated as having 3 coupons), respondents always approach and succeed in recruiting peers who have not already been sampled (i.e., 100% recruitment efficacy ), selecting them at random among the set of their friends who have not participated (no preferences), and respondents accurately report the items used to measure   (either the presence or absence of a tie between their recruiter and their recruitee for the binary question format, or the percentage of their potential recruitees known by their recruite r). This RDS baseline scenario subsumes the first four ways that RDS departs from RWS listed in Table 1. We next examine the fifth through seventh ways that RDS departs from RWS. We look at how less than perfect recruitment efficacy affects estimates by c onsidering a scenario where only 80% of offered coupons are accepted by the targeted peer (“+ less than 100% efficacy”). We then test effects of preferential recruitment (“+ prefere ntial recruitment”), modeling it as a case where a ll respondents are half as likel y to offer to certain types of peers (to white peers in the Project 90 network and freshmen in the Facebook networks). Finally, we examine what happens when respondents misreport 23 recruiter-recruitee ties (“+   measurement error”) . For the binary question format where respondents report on the presence or absence of a tie between their recruiter and recruitee, we subject each report to a 10% random chance of being misattributed (ties reported as non-ties or non-ties reported as ties). For the percent question format where respondents report on the percent of their network alters known by their recruiter, we randomly shift this number by up to ±10% fr om its true value (capping responses at 0 or 1). Table 3. Parameters used in each simulation scenario. Scenario Seeds Selection Replace Branches Efficacy Preferential Error RWS baseline 1 Steady state Yes 1 100% No 0% RDS baseline 10 Convenience No 3 100% No 0% +imperfect (80% efficacy) 10 Convenience No 3 80% No 0% +preferences (targeted recruitment) 10 Convenience No 3 80% Yes 0% +misreporting (   mismeasurement) 10 Convenience No 3 80% Yes 10% In all simulated samples we assume respondents accura tely report de gree. Al though sample size marks a key way in which RDS departs from RWS, we hold target sample sizes constant at 400, which is a small fraction of the population sizes we examine. We found that target sample sizes were attained in all scena rios, which reviews of RDS indicate happens frequently (Malekinejad et al. 2008). Measures We measure the performance of Hardiman and Katzir’s clustering coefficient estimators with three indicators. For each of the question formats (binary or percentage) 24 of each of the estimators (GCC or LCC ) in each scenario, we calculate a) their bias, defined as      󰇛    󰇜     where  is t he number of simulated samples; b) their sampling variance (SV), defined as                         ; and c) their root mean square error (RMSE), defined as    󰇛    󰇜 . Simulation Results Figure 3. Performance of Hardiman Katzir estimators by estimator and question format in RWS on the Project 90 data set. We first consider the distribution of estimates for both the GCC and L CC calculated via the binary and percent question formats in the baseline RWS scenario on the Project 90 network. Figure 3 shows that both estimators, using either question format, 25 exhibit minimal bias that arises because of finite sample sizes. The LCC estimator is less biased than the GCC estimator (       ;       ;      ;        ). Sampling variance is approximately equivalent across estimators and question formats (      ;       ;        ;        ). Considering both bias and sampling variance simultaneously, we find that the LCC percent estimator performs the best and that the percent question form has slightly lower error (       ;       ;      ;       ). We next examine the distribution of estimates in realistic RDS samples and what features of RDS lead to performance deterioration compared to the RWS baseline scenario. Figure 4 shows that in the Project 90 network the GCC estimated using the binary question format performs poorly in each of the RDS scenarios, underestimating the population parameter substantially (GCC binary bias by scenario is       ,       ,        , and      ). Underestimation begins with the RDS baseline scenario and persists, which indicates that problems for this estimator arise from the use of multiple seeds, convenience seed selection, without replacement design, and/or branching. Because we do not see comparable biases in the percent format under these scenarios (GCC percent bias by scena rio is        ,      ,       , and      ), we attribute this bias to the binary question format’s restrictions on effective sample size because the 26 binary question format is only asked of non-seed respondents who recruit others, while the percent format can be asked of any non-seed sample participant. Figure 4. Performance of Hardiman Katzir estimators by estimator and question format in RWS and RDS scenarios on the Pr oject 90 data set. The LCC estimators perform well in Figure 4. The binary question format of the LCC slightly overestimates clustering (L CC binary bias by scenario is       ,      ,       , and      ), while the percent form slightly underestimates it (LCC percent bias by scena rio is       ,       ,       , and      ). 27 Estimates obtained in all RDS scenarios in the Project 90 network exhibit low sampling variance (ranging from 0.001 to 0.003), substantially lower than was found for the RWS scenarios. This result follows from RDS’s without replacement design, which tends to yield lower sampling variance than RWS’s with replacement design. RMSEs in the worst case scenarios, which contain all RDS deviations from RWS that we examine, are lower than we found for the RWS baseline scenarios in all cases: in the +misreporting scenarios, RMSEs are       ;       ;      ;       . Table 4. Distributions of absolute bias statistics and RMSEs in the 29 Facebook networks studied, by scenario, estimator, and question format. Absolute Bias RMSE GCC LCC GCC LCC Binary Percent Binary Percent Binary Percent Binary Percent RWS baseline Min 0.000 0.000 0.000 0.000 0.019 0.006 0.040 0.023 25th percentile 0.000 0.000 0.001 0.000 0.022 0.008 0.046 0.028 Median 0.001 0.000 0.001 0.001 0.023 0.008 0.048 0.030 75th percentile 0.001 0.000 0.002 0.001 0.024 0.009 0.052 0.032 Max 0.002 0.000 0.005 0.003 0.027 0.012 0.058 0.040 RDS baseline Min 0.012 0.006 0.002 0.000 0.025 0.010 0.051 0.025 25th percentile 0.019 0.009 0.010 0.004 0.031 0.014 0.055 0.029 Median 0.020 0.011 0.012 0.007 0.033 0.016 0.057 0.033 75th percentile 0.026 0.013 0.017 0.009 0.037 0.020 0.062 0.038 Max 0.041 0.021 0.025 0.015 0.051 0.028 0.074 0.052 RDS misreporting Min 0.030 0.002 0.032 0.001 0.043 0.009 0.064 0.025 25th percentile 0.046 0.006 0.044 0.003 0.055 0.012 0.068 0.028 Median 0.050 0.008 0.048 0.005 0.057 0.014 0.071 0.031 75th percentile 0.054 0.010 0.051 0.007 0.061 0.017 0.074 0.037 Max 0.065 0.016 0.061 0.014 0.070 0.024 0.089 0.055 28 We next turn to results in the Facebook networks. Table 4 shows how absolute values of bias (“absolute bias”) and RMSEs are distributed within these networks by estimator and question format in three focal scenarios (RWS baseline, RDS baseline, a nd RDS misreporting). We display these scenarios because the +imperfect and +preferences scenarios made little difference in the results. We do not show the low sampling variance we found in all scenarios for the Facebook networks (a maximum of 0.004 ac ross networks in any scenario). The estimators exhibit almost no bias in the RWS baseline scenarios, with a maximum that is substantially lower than was see n in the Proje ct 90 network. The RWS baseline scenario also tends to produc e much lower RMSEs in these networks than it did in the Project 90 network. The RDS scenarios also y ield lower bias in the Fa cebook networks than they did in the Project 90 network, with maximum observed values all lower in the se networks. In terms of bias, the Facebook networks indicate that the binary measures are the most biased, with the LCC being less biased than the GCC. The Fa cebook networks also have lower RMSEs than the Project 90 network. In terms of RMSEs in the realistic RDS scenarios, results from the Facebook networks sugge st that the percent question format is preferable to the binary format and that the GCC is slightly preferred over the LCC after accounting for sampling variance (recall that the LCC had lower bias). I n total, median RMSEs observed in the RDS scenarios in the Facebook networks are only slightly larger than the median RMSEs obtained in the RWS baseline sce narios, which indicates that the clustering coefficient estimators maintain reasonable properties for application to RDS samples. 29 Application of Data Collection Instruments in Six Empirical Surveys Table 5. Summary of Item Response Rates for Clustering Questions in Em pirical Surveys. Survey location Population Format Reports a Invalid % Mean of valid Shanghai, China FSW b Percent 515 0.0% 23.2% Liuzhou, China FSW b Percent 576 0.5% 42.3% Cebu, Philippines PWID c Binary 380 14.2% 78.7% Mandaue, Philippines PWID c Binary 291 8.3% 91.7% Ottawa, Canada PWID c Percent e 364 11.5% 67.0% La Plata, Argentina Veg d Percent 145 5.5% 32.0% La Plata, Argentina Veg d Binary 131 36.6% 30.1% Notes: a-We refer to reports rather than sample size because for the binary questions some respondents report on m ultiple relationships; b -FSW stands for female sex workers; c-PWID stands for persons who inject drugs; d-Veg stands f or self-identifying vegetarians and vegans; e-The format used in the Ottawa Study is an interaction grid in which respondents identify which peers know one another, se e appendix. W e now discuss six empirical RDS surve ys colle cted in diverse hidden populations in multiple countries by different researc h teams that asked respondents the types of questions needed to estimate network clustering. Two studies examine female sex workers in China, two examine people who inject drugs in the Philippine s, one looked at people who inject drugs in Canada, and the last survey, which contained both of our proposed question formats, looked at vegetarians and vegans in Argentina. For th e sake of brevity, we omit full descriptions of these studies in the main text but provide complete details in Appendix B. We focus on the proportion of invalid item responses (“Invalid %”) in each survey across question formats, where we define invalid responses as cases where respondents did not answer the question, gave responses of “don’t know”, or otherwise offered evidence that they did not understand or wish to answer the 30 question. We also compare the mean values of valid responses (“Mean of valid”) between relevant survey pairs (comparing the two surveys in China to each other, and the two surveys in The Philippines to each other), and within individuals who answered both types of questions in the survey in Arge ntina. Table 5 summarizes the item response patterns in these empirical surveys. Respondents were much more likely to give invalid responses to the binary question format than to the percent question format. More speculatively, we can make some claims about conceptual validity by examining the cross-site c oncordance in the means of valid responses within the two sets of paired survey s. For instance, the means of valid responses in the female sex worker surveys collected by overlapping research teams in two cities in China are moderate (23.2%-42.3%), while means of valid responses for the two surveys of persons who inject drugs in Philippines cities are much higher (78.7%- 91.7%). We take these findings to indicate that the survey questions are measuring consistent phenomena. In addition, we find nearly identical means of valid responses between the two question formats implemented in the Argentina survey. Here, both the percent and binary measures find raw clustering levels in the 30.1-32.0% range, and we found that the respondent-specific average of binary format vs. percent format reports had a spearman correlation of 0.445, while the item-specific reports with potentially multiple binary reports per respondent had a polyserial correlation of 0.376. These correlations suggest a reasonably high level of agreement between question formats, even in the face of large amounts of missing data. Taken together, these results indicate that the questions tap into valid concepts, but they add another reason that researchers should 31 prioritize implementing the percent question format: respondents seem more willing or able to answer it. Discussion and Conclusion Sociological interest in marg inalized populations means researchers often confront situations where traditional sampling methods cannot be used. In such situations, RDS’s peer-driven recruitment procedures yield large and diverse samples quickly and cheaply while maintaining respondent anonymity, which is why researchers have used it to sample hundreds of stigmatized, sensitive, and hidden groups. Prior methodological research on RDS has focused on its estimators of the population mea n and avoided examining other features of hidden populations that it can reveal (with a few notable exceptions: Crawford 2016; Wejnert 2010). This avoidance is strategic: practical considerations limit researchers’ ability to uncover most features of the underlying population social network. In this paper, we developed recent work in computer science and proposed new data collection protocols and estimators that allow researchers to examine one network feature of broad interest, clustering. We began by considering estimators of network clustering proposed for random walk sampling (RWS) and expanded their application to the case of RDS, with careful attention to practical differences between RDS and RWS. We offered data collection protocols in the form of two different question formats that RDS survey s could adopt in the field to estimate network clustering and studied their performance in simulations and implementation challenges in six empirical surveys. 32 Overall, we recommend that researc hers using RDS surveys begin asking respondents the types of questions that would allow for clustering coefficient estimation. While RDS estimators of the population mean often fail in the fa ce of unmet assumptions about sample recruitment (Gile and Handcock 2010; Verdery et al. 2016; Merli, Moody, Smith, et al. 2015; Lu et al. 2013; L u et al. 2012; Goel and Salganik 2010; Tomas and Gile 2011; McCreesh et al. 2012), we find that clustering coefficient estimators perf orm well even when core RDS assumptions are violated. We found that the percent question format can be asked of more respondents, yielded better results in a simulation study, and appeared to be better understood by respondents in empirical studies. The two clustering estimators perform similarly , but the GCC estimator had lower total errors than the LCC estimator in most networks we studied. However, sampling variance’s contribution to RMSE drives this result, so researchers concerned about bias may prefer to stick to the LCC estimator, which we found tends to exhibit lower bias. We hope that methods for estimating clustering coefficients from RDS data will spur additional substantive and methodological contributions. Substantively, clustering is a core property that distinguishes human social networks from random graphs (Watts and Strogatz 1998). Structural hypotheses about network diffusion derived from mathematical models hold that levels of clustering influence diffusion dynamics at the network level. For example, such models suggest tha t ceteris paribus moving from low to moderate clustering of the risk network increases transmission (Keeling and Eames 2005), but moving from moderate to high clustering does not change transmission substantially until very high levels when the network becomes disconnected (Newman 2003). Using 33 clustering coefficients from RDS data could allow researc hers to confirm the insights of these mathematical models of network structure and disease diff usion with macro- comparative methods. 2 Second, clustering in the social network may be associated with differences in risk behaviors like unprotected sex at the individual level. Prior research finds that an individual’s network clustering interacted with th e density of contraceptive users strongly affects fertility control (Kohler, Behrman, and Watkins 2001), but that such normative reinforcement can also facilitate the spread of unhealthy behaviors (Yamanis et al. 2015). Previous studies of this topic have been limited to tradit ional survey populations, however, and the approaches developed in this paper will enable researchers to test these hy potheses in a more diverse series of hidden populations. In addition, estimators of network clustering can offer methodological improvements to RDS. An extension could yield additional validation of promising variants of RDS mean estimators that use exponential random graph modeling and algorithmic simulation to obtain less biased, lower variance re sults (Gile an d Handcock 2011). Currently, these approache s model clustering as a byproduct of dyadic homophily. With empirical estimates of clustering, researchers using such algorithms could confirm the clustering coefficients produced in their a lgorithms. A second contribution could allow researchers to test one of the most central but least often evaluated assumptions of RDS, that the network contains a “giant component” where the vast majority of people are reachable through chains of arbitrary length throug h the network ties (Volz and 2 For clarity in this exa mple, we assume that the social net work that the RDS c hain traverses is a close proxy for the risk network for the disease, a connection t hat future researc h should examine more closely. 34 Heckathorn 2008). Using random graph methods from the physics and computer science traditions that generate network structures from degree distributions and clustering coefficients (Newman, Strogatz, and Watts 2001; Hea th and Parikh 2011), researchers may also be able to de termine if they are sampling a network with “bottlenecks,” i.e., where there are few links between cohesive groups in the network, a feature which many in the RDS community link to poor estimate quality (Tol edo et al. 2011). This would add to the emerging diagnostic toolkit being developed for RDS (Gile, Johnston, and Salganik 2015). A related extension of this approach could calculate th e “ structural risk ” of a network sampled with RDS by applying percolation or other diffusion models to examine the size and speed of hypothetical epidemics spreading on the modeled network (Britton et al. 2008; Merli, Moody , Mendelsohn, et al. 2015) – a potential early warning system of a given hidden population’s epidemic potential gathered directly from RDS. Such extensions and future directions lie outside of the scope of the present article. However, we emphasize that rather than an end point, we view this as a beginning. The benefits from estimating c lustering in RDS samples are large, and we encourage researchers to beg in deploying survey questions needed for their calculation. In either case, further attention to RDS’s ability to tell us more about hidden populations than disease prevalence is an important next step for the literature to take. 35 References Barash, Vladimir D., Christopher J. Cameron, Michael W. Spiller, and Doug las D. Heckathorn. 2016. “Respondent -Driven Sampling – Testing Assumptions: Sampling with Replacement.” Journal of Official Statistics 32 (1): 29 – 73. doi:10.1515/jos-2016-0002. Britton, Tom, Maria Deijfen, Andreas N. Lagerås, and Mathias L indholm. 2008. “Epidemics on Random Graphs with Tunable Clustering.” Journal of Applied Probability 45 (3): 743 – 56. Cent ola, Damon. 2010. “The Spread of Behavior in an Online Social Network Experiment.” Science 329 (5996): 1194 – 97. doi:10.1126/science.1185231. Centola, Damon, and Michael Macy. 2007. “Complex Contagions and the Weakness of Long ties1.” American Journal of Sociology 113 (3): 702 – 34. Clouston, S. P., A. M. Verdery, Sara Amin, and G. Robin Gauthier. 2009. “The Structure of Undergraduate Association Networks: A Quantitative Ethnography.” Connections 29 (2): 18 – 31. Crawford, Forrest W. 2016. “The Graphical Structu re of Respondent- Driven Sampling.” Sociological Methodology , April, 81175016641713. doi:10.1177/0081175016641713. Crawford, Forrest W., Peter M. Aronow, Li Zeng, and Jianghong L i. 2015. “Identification of Homophily and Prefe rential Recruitment in Responden t -Driven Sampling.” arXiv:1511.05397 [Stat ] Fisher, Jacob C., and M. Giovanna Merli. 2014. “Stickiness of Respondent -Driven Sampling Recruitment Chains.” Network Science 2 (2): 298 – 301. doi:10.1017/nws.2014.16. G ile, Krista J. 2011. “Improved Infere nce for Respondent -Driven Sampling Data with Application to HIV Prevalence Estimation.” Journal of the American Statistical Association 106 (493). Gile, Krista J., and Mark S. Handcock. 2010. “RESPONDENT  DRIVEN SAMPLING : AN ASSESSMENT OF CURRENT METHODOLOGY. ” Sociological Methodology 40 (1): 285 – 327. ———. 2011. “Network Model -Assisted Inference from Responde nt-Driven Sampling Data.” arXiv Preprint arXiv:1108.0298 . Gile, Krista J., Lisa G. Johnston, and Matthew J. Salgani k. 2015. “Diagnostics for Respondent  driven Sampling. ” Journal of the Royal Statistical Society: Series A (Statistics in Society) 178 (1): 241 – 69. Goel, Sharad, and Matthew J. Salganik. 2009. “Respondent  driven Sampling as Markov Chain Monte Carlo. ” Statistics in Medicine 28 (17): 2202 – 29. ———. 2010. “Assessing Respondent - Driven Sampling.” Proceedings of the National Academy of Sciences 107 (15): 6743 – 47. Goodman, Leo A. 1961. “Snowball Sampling.” The Annals of Mathematical Statistics , 148 – 70. Hardiman, Stephen J., and Liran Katzir. 2013. “Es timating Clustering Coefficients and Size of Social Networks via Random Walk.” I n Proceedings of the 22nd 36 International Conference on World Wide Web , 539 – 50. International World Wide Web Conferences Steering Committee. Heath, Lenwood S., and Nidhi Parikh. 2011. “Generating Random Graphs with Tunable Clustering Coefficients.” Physica A: Statistical Mechanics and Its Applications 390 (23): 4577 – 87. Heckathorn, Douglas D. 1997. “Respondent -Driven Sampling: A New Approach to the Study of Hidden Populations.” Social Problems 44 (2): 174 – 99. Karim, Q. Abdool, A. Meyer-Weitz, L. Mboyi, H. Carrara, G. Mahlase, J. A. Frohlich, and S. S. Abdool Karim. 2008. “The Influence of AIDS Stigma and Discrimination and Social Cohesion on HIV Testing a nd Willingness to Disclose HIV in Rural KwaZulu- Natal, South Africa.” Global Public Health 3 (4): 351 – 65. Keeling, Matt J., and Ken TD Eames. 2005. “Networks and Epidemic Models.” Journal of the Royal Society Interface 2 (4): 295 – 307. Klovdahl, Alden S., John J. Potterat, Donald E. Woodhouse, John B. Muth, Stephen Q. Muth, and William W. Darrow. 1994. “Social Ne tworks and Infectious Disease: The Colorado Springs Study.” Social Science & Medicine 38 (1): 79 – 88. Kohler, Hans- Peter, Jere R. Behrman, and Susan C. Watkins. 2001. “The Dens ity of Social Networks and Fertility Dec isions: Evidence from South Nyanza District, Kenya.” Demography 38 (1): 43 – 58. doi:10.1353/dem.2001.0005. Krivitsky, Pavel N., Mark S. Handcock, and Martina Morris. 2011. “Adjusting for Network Size and Composition Effects in Exponential-Family Random Graph Models.” Statistical Methodology 8 (4): 319 – 39. Lippman, Sheri A., Angela Donini, Juan Díaz, Mag da Chinaglia, Arthur Reingold, and Deanna Kerrigan. 2010. “Social -Environmental Factors and Protective Sexual Behavior among Sex Workers: The Encontros Intervention in Bra zil.” American Journal of Public Health 100 (S1): S216-23. Lovász, László. 1993. “Random Walks on Graphs: A Survey.” Combinatorics, Paul Erdos Is Eighty 2 (1): 1 – 46. Lu, Xin. 2013. “Linked Ego Networks: Improving Estimate Reliability and Validity with Respondent- Driven Sampling.” Social Networks 35 (4): 669 – 85. Lu, Xin, Linus Bengtsson, Tom Britton, Martin Camitz, Beom Jun Kim, Anna Thorson, and Fredrik Liljeros. 2012. “The Sensitivity of Respondent -Drive n Sampling.” Journal of the Royal Statistical Society Series a-Statistics in Society 175: 191 – 216. doi:10.1111/j.1467-985X.2011.00711.x. Lu, Xin, Jens Malmros, Fredrik L iljeros, and Tom Britton. 2013. “Respondent -Driven Sampling on Directed Networks.” Electronic Journal of Statistics 7: 292 – 322. Malekinejad, Mohsen, Lisa Grazina Johnston, Carl Kendall, L igia Regina Franco Sansigolo Kerr, Marina Raven Rifkin, and George W. Rutherford. 2008. “Using Respondent-Driven Sampling Methodology for HI V Biological and Behavioral Surveillance in International Settings: A Sy stematic Review.” AIDS and Behavior 12 (1): 105 – 30. Marsden, Peter V. 1987. “Core Discussion Networks of Americans.” American Sociological Review , 122 – 31. 37 McCreesh, Nicky, Andrew Copas, Janet Seeley , Lisa G. J ohnston, Pam Sonnenbe rg, Richard J. Hayes, Simon D. W. Frost, and Richard G. White. 2013. “Respondent Driven Sampling: Determinants of Recruitment and a Method to I mprove Point Estimation.” Plos One 8 (10): e78402. doi:10.1371/journal.pone.0078402. McCreesh, Nicky, Simon Frost, Janet Seeley, Joseph Katongole, Matilda Ndagire Tarsh, Richard Ndunguse, Fatima Jichi, Natasha L. Lunel, Dermot Maher, a nd Lisa G. Johnston. 2012. “Evaluation of Respondent - Driven Sampling.” Epidemiology (Cambridge, Mass.) 23 (1): 138. McPherson, Miller, Lynn Smith- Lovin, and Matthew E. Brashears. 2006. “Social Isolation in America: Changes in Core Discussion Networks over Two Decades.” American Sociological Review 71 (3): 353 – 75. McPherson, Miller, Lynn Smith-Lovin, and Jame s M. Cook. 2001. “Birds of a Feather: Homophily in Social Networks.” Annual Review of Sociology , 415 – 44. Merli, M. Giovanna, James Moody, Joshua Mendelsohn, and Robin Gauthier. 2015. “Sexual Mixing in Shanghai: Are Heterosexual Contact Patterns Compatible with an HIV/AIDS Epidemic?” Demography 52 (3): 919 – 42. Merli, M. Giovanna, James Moody, Jeffrey Smith, Jing Li, Sharon Weir, a nd Xiangsheng Chen. 2015. “Challenges to Recruiting Population Represe ntative Samples of Female Sex Workers in China Using Respond ent Driven Sampling.” Social Science & Medicine 125: 79 – 93. Merli, M. Giovanna, William Whipple Neely, Tu Xiaowen, Gu Weimin, and Yang Ya ng. 2010. “Sampling Female Sex Workers in Shang hai Using Respondent Driven Sampling.” In Rational Judgement. Public Health and Social Development , edited by Xia Guomei and Yang Xiushi, 293 – 308. Milgram, Stanley. 1967. “The Small World Problem.” Psychology Today 2 (1): 60 – 67. Moody, James. 2002. “The Importance of Relationship Timing for Diff usion.” Social Forces 81 (1): 25 – 56. Moody, James, and Richard A. Benton. 2016. “Interdependent Effects of Cohesion and Concurrency for Epidemic Potential.” Annals of Epidemiology 26 (4): 241 – 48. doi:10.1016/j.annepidem.2016.02.011. Morris, Martina, Ann E. Kurth, Deven T. Hamilton, James Moody, and Steve Wakefield. 2009. “Concurrent Partnerships and HIV Prevalence Disparities by Race: Linking Science and Public Health Practice.” American Journal of Public Health 99 (6): 1023 – 31. doi:10.2105/AJPH.2008.147835. Mouw, Ted, and Ashton M. Verde ry. 2012. “Network Sampling with Memory A Proposal for More Efficient Sampling from Social Networks.” Sociological Methodology 42 (1): 206 – 56. National Epidemiology Center, Department of Health, Philippines. 2014. “ 2013 Integrated HIV Behavioral and Serolo gic Surveillance (IHBSS).” http://www.aidsdatahub.org/2013-integrated-hiv-behavioral-and-serologic- surveillance-ihbss-national-epidemiology-center. Neely, William Whipple. 2009. “Statistical Theory for Respondent - D riven Sampling.” Ph.D., United States -- Wisconsin: The University of Wisconsin - Madison. 38 http://search.proquest.com.libproxy.lib.unc.edu/pqdtglobal/docview/305033289/a bstract/96BB2CDA89994EB2PQ/1. Nesterko, Sergiy, and Joseph Blitzstein. 2015. “Bias -Variance and Breadth-Depth Tradeoffs in Respondent- Driven Sampling.” Journal of Statistical Computation and Simulation 85 (1): 89 – 102. doi:10.1080/00949655.2013.804078. Newman, Mark EJ. 2003. “Properties of Highly Clustered Networks.” Physical Review E 68 (2): 26121. Newman, Mark EJ, Steven H. Strogat z, and Duncan J. Watts. 2001. “Random Graphs with Arbitrary Degree Distributions and Their Applications.” Physical Review E 64 (2): 26118. “Office of Population Research, Princeton University .” 2015. Accessed December 11. http://opr.princeton.edu/archive/p90/. Pilon, Richard, Lynne Leonard, John Kim, Dominic Vallee , Emil y De Rubeis, Ann M. Jolly, John Wylie, Linda Pelude, and Paul Sandstrom. 2011. “Tr ansmission Patterns of HIV and Hepatitis C Virus among Ne tworks of People Who Inject Drugs.” PLOS ONE 6 (7): e22245. doi:10.1371/journal.pone.0022245. Potterat, J. J., D. E. Woodhouse, S. Q. Muth, R. Rothenberg, W. W. Darrow, A. S. Klovdahl, and J. B. Muth. 2004. “Network Dynamism: History and Lessons of the Colorado Springs Study.” Network Epidemiology: A Handbook for Survey Design and Data Collection, Ed. M. Morris, New York: Oxford University Pre ss , 8 7 – 114. Rhodes, Tim, and Milena Simic. 2005. “Transition and the HI V Risk Environment.” Bmj 331 (7510): 220 – 23. Rothenberg, Richard B., Donald E. Woodhouse, John J. Potterat, Stephen Q. Muth, William W. Darrow, and Alden S. Klovdahl. 1995. “Social Network s in Disease Transmission: The Colorado Springs Study.” NIDA Research Monograph 151: 3 – 19. Salganik, Matthew J., and Douglas D. Heckathorn. 2004. “Sampling and Esti mation in Hidden Populations Using Respondent  driven Sampling. ” Sociological Methodology 34 (1): 193 – 240. Schneider, John A., Benjamin Cornwell, David Ostrow, Stuart Michaels, Phil Schumm, Edward O. Laumann, and Samuel Friedman. 2012. “Ne twork Mixing and Network Influences Most Linked to HIV Infection and Risk Behavior in the HIV Epidemic Among Black Men Who Have Sex With Men.” American Journal of Public Health 103 (1): e28 – 36. doi:10.2105/AJPH.2012.301003. Silverman, Kenneth, Conrad J. Wong, Mick Needham, Karly N. Diemer, Todd Knealing, Darlene Crone-Todd, Michael Fingerhood, Paul Nuzzo, and Kenneth Kolodner. 2007. “A Randomized Trial of Employment -Based Reinforcement of Cocaine Abstinence in Injection Drug Users.” Journal of Applied Behavior Analysis 40 (3): 387. Smith, Jeffrey A. 2012. “Macrostruc tu re from Microstructure Generating Whole S y stems from Ego Networks.” Sociological Methodology 42 (1): 155 – 205. Toledo, Lidiane, Cláudia T Codeço, Neilane Bertoni, Elizabeth Albuquerque, Monica Malta, and Francisco I Bastos. 2011. “Putting Responde nt -Driven Sampling on the Map: Insights from Rio de Janeiro, Brazil:” JAIDS Journal of Acquired 39 Immune Deficiency Syndromes 57 (August): S136 – 43. doi:10.1097/QAI.0b013e31821e9981. Tomas, Amber, and Krista J. Gile. 2011. “The Effect of Differential Rec ruitment, Non - Response and Non-Recruitment on Estimators for Respondent- Driven Sampling.” Electronic Journal of Statistics 5: 899 – 934. Traud, Amanda L., Eric D. Kelsic, Peter J. Mucha, a nd Mason A. Porter. 2011. “Comparing Community Structure to Character istics in Online Collegiate Social Networks.” SIAM Review 53 (3): 526 – 43. Traud, Amanda L., Peter J. Mucha, and Mason A. Porter. 2012. “Social Struc ture of Facebook Networks.” Physica A: Statistical Mechanics and Its Applications 391 (16): 4165 – 80. Verdery, Ashton M., M. Giovanna Merli, James Moody, Jeffrey A. Smith, and Jacob C. Fisher. 2015. “Respondent -Driven Sampling Estimators Under Real and Theoretical Recruitment Conditions of Female Sex Workers in China.” Epidemiology 26 (5): 661 – 65. Verdery, Ashton M., Ted Mouw, S hawn Bauldry, and Peter J. Mucha. 2016. “Network Structure and Biased Variance Estimation in Respondent Driven Sampling .” PLoS ONE . Volz, Erik, and Douglas D. Heckathorn. 2008. “Probability Base d Estimation Theory for Respondent Driven Sampling.” Journal of Official Statistics 24 (1): 79. Watts, Duncan J., and Steven H. Strogatz. 1998. “Collective Dy namics of ‘small - World’networks.” Nature 393 (6684): 440 – 42. Weir, Sharon S., M. Giovanna Merli, Jing Li, Anisha D. Gandhi, William W. Neely, Jessie K. Edwards, Chirayath M. Suchindran, Gail E. Henderson, and Xiang- Sheng Chen. 2012. “A Comparison of Respondent -Driven and Venue-Based Sampling of Female Sex Workers in Liuzhou, China.” Sexually Transmitted Infections 88 (Suppl 2): i95 – 101. Wejnert, Cyprian. 2009. “A n Empirical Test of Respondent-Driven Sampling: Point Estimates, Variance, Degree Measure s, and Out-of- Equilibrium Data.” Sociological Methodology 39 (1): 73 – 116. doi:10.1111/j.1467- 9531.2009.01216.x. ———. 2010. “Social Network Analysis with Respondent -Driven Sampling Data: A Study of Racial Integration on Campus.” Social Networks 32 (2): 112 – 24. White, Richard G., Avi J. Hakim, Matthew J. Salganik, Micha el W. Spiller, Lisa G. Johnston, Ligia Kerr, Carl Kendall, et al. 2015. “ Strengthening the Reporting of Observational Studies in Epidemiology for Respondent-Driven Samplin g Studies: ‘STROBE - RDS’ Statement.” Journal of Clinical Epidemiology , Ma y . doi:10.1016/j.jclinepi.2015.04.002. White, Richard G., Amy La nsky, Sharad Goel, David Wilson, Wolfgang Hladik, Avi Hakim, and Simon DW Frost. 2012. “Responde nt Driven Sampling— where We Are and Where Should We Be Going?” Sexually Transmitted Infections 88 (6): 397 – 99. 40 WHO. 2013. Introduction to HIV/AIDS and Sexually Transmitted Infection Surveillance: Module 4: Introduction to Respondent Driven Sampling. Geneva, Switzerland: World Health Organization. http://www.who.int/iris/handle/10665/116864. Woodhouse, Donald E., Richard B. Rothenberg, John J. Potterat, William W. Darrow, Stephen Q. Muth, Alden S. Klovdahl, Helen P. Zimmerman, Helen L. Rogers, Tammy S. Maldonado, and John B. Muth. 1994. “Mapping a Social Network of Heterosexuals at High Risk for HIV Infection.” Aids 8 (9): 1331 – 36. Yamanis, Thespina J., Jacob C. Fisher, James W. Moody, a nd Lusajo J. Kajula. 2015. “Young Men’s Social Network Characteristics and Associations with Sexual Partnership Concurrency in Tanzania.” AIDS and Behavior 20 (6): 1244 – 55. doi:10.1007/s10461-015-1152-5. Yamanis, Thespina J., M. Giovanna Merli, William Whipple Neely, Felicia Fe ng Tian, James Moody, Xiaowen Tu, and Ersheng Gao. 2013. “An Empirica l Analysis of the Impact of Recruitment Patterns on RDS Estimates among a Sociall y Ordered Population of Fema le Sex Workers in China.” Sociological Methods & Research 42 (3): 392 – 425. 41 Appendix A: Other Seed Selection Pr ocedures In the main text of the article we defined a ll of the RDS scenarios as starting from a uniform random sample of seeds. In this appendix, we consider 4 alternative scenarios in the Project 90 network that vary seed selection procedures but otherwise retain all features of the “+misreporting” scenarios (we found no difference for the other RDS scenarios but do not report on them here). In these scenarios we select seeds: 1) uniformly at random from white nodes only (“+white”); 2) uniformly at random from non-white nodes only (“ +non-white ”); 3) with probability proportional to their level of local clustering (“+high cluster); 4) with probab ility inverse proportional to their level of local clustering (“+low cluster). Table A1 shows the results under these alternative seed selection scenarios. We found few meaningful difference s between the results provided in the main text of the manuscript and those obtained with alternative seed selection procedure s. N one of the biases changed directions, the largest change in the RMSEs was a level of 0.03 (for the GCC binary estimates), and, in general, the rank ordering of e stimator performance was maintained with the percent question formats having lower RMSEs than the binary formats. Table A1. Bias and RMSEs in the Project 90 network, by alternative seed selection scenario, estimator, and question format. Bias Measures RMSE GCC LCC GCC LCC Scenario Binary Percent Binary Percent Binary Percent Binary Percent +misreporting -0.067 -0.006 0.019 -0.015 0.076 0.034 0.057 0.045 +non-white seeds -0.097 -0.038 0.006 -0.040 0.102 0.042 0.053 0.057 +white seeds -0.067 -0.006 0.019 -0.016 0.076 0.033 0.057 0.045 42 +high cluster seeds -0.076 -0.014 0.010 -0.016 0.085 0.033 0.056 0.045 +low cluster seeds -0.077 -0.030 0.043 -0.037 0.085 0.038 0.067 0.057 43 Appendix B: Survey Questions Used in Empirical Surveys This appendix provides the specific survey questions used in the six empirical studies reviewed in the “Applications in Empirical Survey s” section. The Shanghai Women’s Health Study was collected in 2007 using RDS of female sex workers living in Shanghai, China (Merli et al. 2010; Yamanis et al. 2013). This study’s protocol was approved by th e Research Ethics Committee of the University of Wisconsin, Madison and the Shanghai Institute of Planned Parenthood Researc h. This survey used a percent question format, where non-seed respondents were asked the following two questions: Q.901. In Shanghai, how many of this kind of sex workers do you know? You know how to address them, they know how to addre ss you, and you have met or contacted them in the past month. Q.904. Among those people (the people in 901), how many do both you and your contact (the person who introduced you to the project) know? We obtain the percent by dividing the answer to Q.904 by the answer to Q .901. The RDS component of the PLACE-RDS Comparison Study sampled female sex workers in Liuzhou, China in 2010 (W eir et al. 2012). This stud y wa s approved b y the Research Ethics Committee of the National Center for STD Control, China and the Institutional Review Boards at the University of North Carolina a nd Duke University. This survey was conducted by members of the same team as the Shanghai study, and it also used the percent format by a sking two iterative questions. Non-seed respondents in this survey were asked: 44 Q.901. In Liuzhou city (including Liuzhou counties), how many women do you know personally who are sex workers? By sex worker, I mean that they are paid money in exchange for sex. By know personally, I mean: -you know their name and they know yours -you know who they are and they know you -you have seen or contacted them in the past four weeks Q.904. Of the (repeat response number from 901) sex workers you know, how many are also known by the person who gave y ou this coupon? As above, we obtain the percentage by dividing the answers to these questions. The Characterizing the Social Networks of Women and Men in Ottawa who Inject Drugs to Drive Prevention Programming Study sampled people who inject drugs in Ottawa, Canada in 2007 (Pilon et al. 2011). The Ottawa Hospital Research Ethics Board approved the study. This study asked respondents a percentage format of the question, but the approach used to collect these data differed from the format asked in the two studies of female sex workers in China that we reviewed above. Rather than asking respondents counts of potential recruitees that know the res pondents’ recruiter , trained interviewers directly asked respondents questions to elicit ego networks, and then asked them to complete an interaction grid recording contact between ego network peers. First respondents were asked to list members they know: Q.1.) First, please think back over the last 30 days about the people w ith whom you have had more than casual contac t. These would be people that you have seen or have spoken to on a regular basis. Most of these close contacts would be people such as friends, family, sex partners, people you inject drugs with, or people you live with. Let’s make a list of these people starting with those who inject drugs. Please use only initials, or some other identifier that will make sense to you such as a made up name. Please do not use their last names. We will use this list to make sure we know w hich individuals we are talking about. Remember that we are intere sted in people that you’ve had contact with in the last 30 day s. 45 Then interviewers worked with respondents to fill out an interaction grid on the basis of the following instructions: Q3. (Following step 2, transfer the names of all the network members from the previous question onto the interaction grid – list the contacts in the ID column going down from 1-20. For each person listed, ask the subject to indicate which of the other individuals on the list that particular person knows or has contact with. Indicate whether they know one another by placing an X in the appropriate box. You are working down through the columns, not across. E.g., if Sam is ID#1, you will go down column 1 and ask if Sam knows Tom, Mary, Mac, OT, etc. In Column 1, you will end up with an X beside each of Sam’s contacts. Next, move to column 2 and do the same for Tom, then move to column 3, column 4, etc.) We obtained percentages by calculating the eg o-n etwork density of this matrix. We leave it for future investigation to determine whether this approach provides meaningfully different results than the percent format question recommended in text, because implementing this interaction grid adds substantial time to the data collec tion process. The third and fourth studies we examine come from two surveys that were part of the Integrated HIV Biological and Serologica l Surveillance Study fielded b y researchers 46 at the Philippines Department of Health in 2013 (National Epidemiology Center, Department of Health, Philippines 2014). Data collection was a surveillance activity and was not subject to institutional review board a pproval, but secondar y data analysis received approval from the Institutional Review Board of the University of North Carolina at Chapel Hill. These studies surveyed people who inject drugs in Cebu City and Mandaue City, The Philippines, a binary format of the question. Specifically, they asked respondents the following question: 1. Does the person you gave a coupon to, and your recruiter (that is, the person who gave you your coupon) know each other? Finally, we examine early results from a sixth RDS study. The pilot survey EncuestaVeg sampled vegetarians and vegans living in La Plata, Argentina, where avoiding meat is such a rare activity a s to make those who identify with the practice a hidden population. This ongoing pilot survey was begun in June, 2016; we report on results obtained as of September, 2016. The protocol for this survey was approved by the Institutional Review Board of the Pennsylvania State Unive rsity. In it, respondents were asked both the percent and the binary question. First, during the primary intervie w, non- seed respondents were asked a percent format question: 13.1. Think about all the people you know who live in the city of La Plata ages 18 and up. How many vegans and vegetarians total do you know (you know their name and they know yours)? 13.9. Think of the person who gave you the code. Of the rest of the vegans and vegetarians who you know in La Plata, how many also know the person who gave you the code? Percentages were obtained by dividing these questions. Note that Q13.9 did not specifically reference the answer given for Q13.1, and also that the re sponse entry was 47 open ended. Some respondents said larger numbers in 13.9 than they did for 13.1, while others gave string responses such as “todos [all]”, or “ Casi todos [nearly all]”. In the main text, we report these case s as invalid responses (except todos , which we code as 100%). In addition to the percent question format, recruiting participants in EncuestaVeg who returned to complete the follow-up survey were asked a series of questions about who they invited to participate and a question that allows us to calcula te the binary question format. Specifically, for each person they invited, they were asked: Q.F.18. Does this person know the person who gave you the code to answer the survey ? We use answers to this question as the binary question format.

New Survey Questions and Estimators for Network Clustering with Respondent-Driven Sampling Data

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment