Skip sequencing: A decision problem in questionnaire design

The Annals of Applie d Statistics 2008, V ol. 2, No. 1, 264–285 DOI: 10.1214 /07-A OAS134 c  Institute of Mathematical Statistics , 2 008 SKIP SEQUENCING: A DECISION PR OB LEM IN QUESTIONNAIRE DESIGN By Charles F. Manski 1 and France sca Molinari 2 Northwestern University and Cornel l University This paper studies questionnaire design as a formal decision prob - lem, focusing on one element of the design process: skip sequencing. W e prop ose that a survey p lanner use an exp licit loss function to quantify the trade-oﬀ b etw een cost and informativeness of the sur- vey and aim to make a design c hoice th at minimizes loss. W e p ose a choice b etw een three options: ask all resp ond ents about an item of interes t, use skip sequencing, thereby asking the item only of re- sp on d ents who give a certain answer to an opening question, or d o not ask th e item at all. The ﬁrst o ption is mo st informativ e but also most costly . The use of skip sequencing redu ces resp ond ent bu rden and the cost of interview ing, b ut may spread data quality problems across survey items, thereby reducing in formativ eness. The last op- tion has no cost but is completely uninformative ab out the item of interes t. W e show how the planner ma y choose among these three options in the presence of tw o inferentia l problems, item n onresp onse and response error. 1. In tro d uction. Designing a questionnaire for admin istr ation to a sam- ple of resp ondents requires many decisions ab out the items to b e aske d , the w ord ing and orderin g of the questions, and so on . Considerable researc h has in vestig ated the item resp ons e rates and patterns asso ciated with alternativ e designs. See Krosnic k ( 1999 ) for a r ecen t review of the literature. Researc hers ha ve also called atten tion to the tension b et we en the desire to r educe the costs and increase the informativ eness of su rv eys. S ee, for example, Gro ves ( 1987 ) and Gr o ve s and Heeringa ( 2006 ). How ev er, s u rve y r esearc hers hav e not stu died questionnaire design as a formal decision problem in wh ich one Received July 2007; revised September 2007. 1 Supp orted in part b y National I n stitute of Aging Grants R21 AG0 28465-01 and 5P01A G026571-02, and by NSF Grant SES-05-4954 4. 2 Supp orted in part by National Institute of Aging Grant R21 AG0 28465-01 and by NSF Gran t S ES -06-17482. Key wor ds and phr ases. Skip sequencing, questionnaire d esign, item nonresp onse, re- sp on se error, partial iden tiﬁcation. This is an electro nic repr int of the orig inal article published by the Institute of Mathematical Statistics in The Annals of Applie d Statistics , 2008, V ol. 2, No. 1, 264 –285 . This r eprint diﬀers from the orig ina l in pag ination and typogr aphic detail. 1 2 C. F. MANSKI A ND F. MOLINARI uses an exp licit loss fun ction to quan tify the trade-oﬀ b etw een cost and in- formativ eness an d aims to mak e a design c h oice that min imizes loss. Th is pap er tak es an initial s tep in th at d irection. W e consider one elemen t of the design problem, the u se of skip s equ encing. Skip se qu encing is a widespread su r v ey p ractice in which the resp onse to an op enin g question is used to determine whether a resp ond en t sh ould b e ask ed certa in sub sequent qu estions. Th e ob j ectiv e is to eliminate in ap p li- cable qu estions, thereby reducing resp ondent bur den and the cost of inter- viewing. Ho w ever, skip sequencing can amp lify data qualit y p roblems. In particular, skip sequencing exacerbates the iden tiﬁcation pr oblems caused b y item nonresp onse and resp onse errors. A resp ond ent ma y not answ er the op enin g question. When this happ ens, a common pr actice is to lab el the sub sequen t q u estions as inapplicable. H o w- ev er, they ma y b e applicable, in whic h case the item nonresp onse problem i s ampliﬁed. Another practice is to impute th e answe r to the op ening question and, if the imputation is p ositiv e, to also impute answ ers to the subsequent questions. S ome of these imputations will inevitably b e incorrect. A partic- ularly o dd situation o ccurs when th e answ er to the op ening question should b e negativ e but the imp utation is p ositive . Then ans w ers are imputed to subsequent questions that actually are inapplicable. A resp ondent ma y answe r th e op enin g question with error. An er r or may cause subsequ ent questions to b e skipp ed, when they should b e ask ed, or vice v ersa. An error of the ﬁrst t yp e induces n onresp ons e to the su b sequent ques- tions. T he consequen ces of an error of the second t yp e dep end on ho w the resp ond en t answe rs the subsequent qu estions, ha ving an s w ered the op ening one incorr ectly . Illustr a tion 1. The 2006 wa ve of the Health and Retiremen t Stud y (HRS) asked current So cial Secur it y recipien ts ab out their exp ectations f or the future of t he S o cial Securit y syste m. An op en ing question ask ed broadly: “Thinking of th e So cial Security program in general and not ju st yo ur o wn So cial Secur it y b eneﬁts: On a scale from 0 to 100 (where 0 means n o c hance and 100 means absolutely c ertain), w hat is the p ercent c hance that Congress will c h ange Social S ecurit y sometime in the next 10 ye ars, so that it b ecomes less generous than no w ?” If the answe r was a num b er greater than zero, a follo w-up question ask ed “W e ju st ask ed you ab out c hanges to So cial Securit y in general. No w we w ould like to kn o w whether you think these So cial Security c hanges migh t aﬀect y our o wn b eneﬁts. On a scale from 0 to 100, what do you think is the p ercen t chance that the b eneﬁts you y our s elf are r eceiving from S o cial Securit y will b e cut some time o ver the next 10 years?” If a p erson d id not resp ond to the op en in g question or ga v e an an s w er of 0, the follo w-up question w as not asked. SKIP SEQUENCING 3 Illustr a tion 2. The 1990 wa v e of the National Longi tudinal Surv ey of Older Men (NLSOM) qu eried resp ond en ts ab out their limitations in activi- ties of daily living (ADLs). An op enin g question aske d b roadly: “Because of a health or physica l problem, do y ou ev er need help fr om any one in lo oking after p erson al care suc h as dressing, bathing, eating, going to the bathroom, or other such daily activities?” If the answ er wa s p ositiv e, the resp ondent w as then ask ed if he/she r eceiv es h elp from a nother p erson in eac h of six sp e- ciﬁc ADLs (bathing/sho we ring, dressin g, eating, getting in or out of a chair or b ed, w alking, using th e toilet). If the answer w as n egativ e or missin g, the subsequent questions we re sk ip p ed out. These illustrativ e uses of skip sequencing sav e su rv ey costs b y asking a broad question ﬁ rst and by follo wing up with a more sp eciﬁc qu estion only when the answ er to the broad q u estion meets s p eciﬁed criteria. Ho w ev er, nonresp onse or resp onse error to the op ening question ma y compromise the qualit y of the data obtained. This pap er studies skip sequencing as a deci sion problem in questionnaire design. W e supp ose that a sur v ey plann er is considering wh ether and ho w to ask ab out an item of inte rest. Three d esign options follo w: Option All ( A ): ask all resp ondent s the question. Option Skip ( S ): ask only those resp onden ts w ho resp ond p ositiv ely to an op ening qu estion. Option None ( N ): do not ask the question at all. These options v ary in the cost of admin istering the qu estions and in the informativ eness of th e d ata they yield. Option ( A ) is most costly and is p oten tially most informativ e. Option ( S ) is less costly but ma y b e less infor- mativ e if the op enin g question has nonresp onse or resp onse err ors. O p tion ( N ) h as no cost bu t is uninformativ e ab out the item of interest. W e su p- p ose that the p lanner must choose among these options, wei ghing cost and informativ eness as h e deems appropriate. W e suggest an approac h to this decision p roblem and giv e illustrativ e applications. The pap er is o rganized as follo ws. As a p relude, Section 2 summarizes the few preceden t studies that c onsider the d ata qualit y asp ects of skip sequ enc- ing. Th ese studies do n ot analyze skip sequencing as a decision p roblem. Section 3 f ormalizes the p roblem of c hoice among design options. W e assume that the surve y planner wan ts to m inimize a loss function whose v alue dep ends on the cost of a design option and its informative ness. T h us, ev aluation of the design o ptions requires that the planner measure their cost and inf ormativ eness. Supp ose that a p lanner w an ts to com b ine s ample data on an item with sp eciﬁed assum p tions in order to learn ab out a p opulation parameter of in- terest. When the sample s ize is large, we prop ose that inf ormativ eness b e 4 C. F. MANSKI A ND F. MOLINARI measured by the s ize of the identiﬁca tion region th at a design option yields for this parameter. As explained in Manski ( 2003 ), the identiﬁc ation r e gion for the p arameter is the set of v alues that remain feasible when unlimited o b- serv ations from the sampling pro cess are com b ined with the m aintained as- sumptions. T he p arameter is p oint-identiﬁe d when this set conta in s a sin gle v alue and is p artial ly identiﬁe d when the set is smaller than the parameter’s logica l range, but is not a single p oin t. In surv ey settings with large samples of resp ondents, w h ere identiﬁca tion rather than statistica l infer en ce is the dominan t infer ential problem, we thin k it natural to m easure informative - ness b y the size of the identiﬁca tion region. The smaller the iden tiﬁcation region, the b etter. S ection 6 discusses m easuremen t of in formativ eness wh en the sample size is small. Then conﬁd ence interv als for the partially identiﬁed parameter may b e used to measure inf ormativ eness. Sections 4 and 5 apply the general ideas of Section 3 in tw o p olar settings ha ving distinct inferential problems. Section 4 studies cases in whic h there ma y b e nonresp onse to the qu estions p osed bu t it is assu med that there are no resp onse errors . W e ﬁrst derive the identiﬁca tion regions u nder options A, S and N . W e then sho w the circu m stances in wh ic h a su rve y plann er should c h o ose eac h option. T o illustrate, we consid er c h oice among options for queryin g resp ondents ab out their exp ectations for fu tu re p ersonal So cial Securit y b eneﬁts. The HRS 2006 used skip sequencing, as describ ed in Illus- tration 1 . Another option would b e to ask all resp ond en ts b oth the broad and the p ersonal question. A third option would b e to ask only the broad question, omitting the on e ab ou t future p ersonal b eneﬁts. Section 5 studies the other p olar setting in wh ich ther e is full resp onse but there ma y b e resp onse errors. Again, we ﬁrst derive the id en tiﬁcation regions under the three design options and then sho w when a survey planner should c h o ose eac h option. T o illustrate, we consid er c h oice among options for querying resp ondents ab out limita tions in ADLs. The NLSOM used skip sequencing, as d escrib ed in Illustration 2 . An other sur v ey , the 1993 wa ve of the Assets and Health Dynamics Among the Oldest Old (AHEAD) aske d all resp ond en ts ab out a set of sp eciﬁc ADLs. A third option w ould b e to not ask ab out sp eciﬁc ADLs at all. Section 6 concludes b y calling for fu r ther analysis of qu estionnaire design as a decision problem. 2. Previous studies of skip sequencing. As far as w e are aw are, there has b een no preceden t researc h stu dying skip sequencing as a decision problem in questionnaire d esign. Messmer and Seymour ( 1982 ) and Hill ( 1991 , 1993 ) are the only p receden t stud ies reco gnizing that skip sequencing ma y amplify data qualit y problems. Messmer and Seymour studied the eﬀect of skip sequencing on item non- resp onse in a large s cale m ail surve y . Their analysis ask ed wh ether the dif- ﬁcult structur e of the surv ey , particularly the fact that resp ondents we re SKIP SEQUENCING 5 instructed to skip to other questions p erhaps sev eral p ages a wa y in the questionnaire, increased the num b er of u nanswered questions. Their analy- sis in dicates th at br an ching instru ctions signiﬁcantl y increased the r ate of item nonresp onse for questions follo wing a b ranc h , and that th is eﬀect w as higher for older individ u als. This wo rk is in teresting bu t it do es not ha ve direct implications for mo dern surve ys, where skip sequen cing is automated rather than p erformed manuall y . Hill used d ata from ﬁ v e in terview/reinte rview sequence pairs in the 1984 Survey of Income and Program Participat ion (SI PP) Rein terview Program. He examined data errors th at manifest themselv es through a discrepancy b et w een the resp onses giv en in the tw o in terviews, and categ orized these discrepancies in three groups. In his terminology , a resp onse discrepancy o ccurs wh en a diﬀerent answer is recorded for an op ening question in the in terview and in th e reint erview. A resp onse indu ced sequ encing discrepancy o ccurs when, as a consequence of diﬀerent answers to the op ening question, a su b sequent question is ask ed in only one of the tw o interviews. A pr o- cedurally indu ced sequ encing discrepancy o ccurs w hen, in one of the tw o in terviews b ut not b oth, an op ening q u estion is not asked and, therefore, the su bsequent q u estion is not ask ed either. Hill u sed a discrete conta gious regression mo del to assess the relativ e imp ortance of these errors in reducing data qualit y . The conta gion pro cess w as us ed to express the idea that er r or s p reads fr om one question to the next via skip sequencing. Within this mo d el, the “conditional p opulation at risk o f contag ion” expresses th e idea th at the num b er of remaining questio n s in the sequence at the p oin t wh ere the initiating error o ccurs giv es an upp er b ound on th e num b er of errors that can b e induced. Hill’s r esults s uggest that the losses of data r eliabilit y caused b y induced s equ encing errors are at least as large as those ind uced by resp onse errors. Moreo v er, the relativ e imp ortance of sequencing errors strongly increases with the sequence l ength. This su ggests that the reliabilit y of individ ual items will b e lo wer, all else equal, the later they app ear in the sequence. 3. A formal design p roblem. 3.1. The choic e setting. W e p ose h ere a formal questionnaire design problem that h ighligh ts ho w skip sequencing m a y aﬀect data qu alit y . T o fo cus on this matt er, w e ﬁnd it h elpful to simplify the c hoice s etting in three ma jor resp ects. First, w e s u pp ose that a large rand om samp le of resp ondents is dra wn from a m uc h larger p opu lation. This b rings identiﬁcatio n to the fore as the dominan t inferent ial p roblem, the statistical p recision of sample estimates receding into the bac kgroun d as a minor concern. W e also supp ose that all sample members agree to b e int erview ed. Hence, inferenti al problems 6 C. F. MANSKI A ND F. MOLINARI arise only from item nonresp ons e and resp onse errors, not from in terview nonresp onse. Second, w e p erform a “marginalist” analysis that supp oses the entire d e- sign of the questionnaire has b een set except f or one item. The only decision is whether and how to ask ab out th is item. Marginalist analysis enormously simpliﬁes the d ecision problem. In p ractice, a survey planner must choose the ent ire structure of the questionnaire, and the choic e made about one item ma y in teract with choic es made ab ou t others. W e r ecognize this but, nev erth eless, ﬁ nd it usefu l for exp osition to fo cus on a single asp ect of the global d esign problem, holding ﬁxed the remainder of the qu estionnaire. Third, w e assu me that the d esign c hosen for the sp eciﬁc item in our marginalist analysis aﬀects only the informativ en ess of that item. In practice, the choice of ho w to ask a sp eciﬁc item aﬀects the length of the en tire survey , whic h m a y in ﬂuence resp ond en ts’ willingness or ability to p ro vide reliable resp onses to other items. W e recognize this but, n evertheless, ﬁnd it useful for exp osition to supp ose th at the eﬀect on other items is negligible. Let y denote the item under consideration. As indicated in the In tro duction , the d esign options are as follo ws: A : ask all resp on d en ts to rep ort y . S : ask only those resp ondents w ho resp ond p ositive ly to an op ening question. N : do n ot ask ab out y at all. The p opu lation parameter of inte rest is lab eled τ [ P ( y )] , where P is the p opulation d istribution of y . F or example, τ [ P ( y )] migh t b e th e p opulation mean or median v alue of y . 3.2. Me asuring the c ost , informativeness , and loss of the design options. The design options d iﬀer in their costs and in their informativ eness ab out τ [ P ( y )] . Abstractly , let c k denote the cost of option k , let d k denote its informativ eness, and let L k = L ( c k , d k ) b e th e loss that the su r v ey planner asso ciates with option k . W e supp ose that the planner wan ts to c h o ose a design option that min imizes L ( c k , d k ) o ve r k ∈ ( A, S, N ). T o op erationalize this abstract optimization problem, a su rv ey planner m u st decide ho w to measure loss, cost, and informativ eness. Loss presumably increases with cost and decreases with informative ness. W e will not b e more sp eciﬁc ab out the form of the loss f u nction h ere. W e will, for simplicit y , us e a linear form in our applications. Cost presumably increases with the fraction of respond en ts who are ask ed the item. In some settings, cost may b e prop ortional to th is fraction. T hen c k = γ f k , where γ > 0 is the cost p er resp ondent of data collection and f k is the fraction of resp ondents asked the item under option k . It is th e case that 1 = f A ≥ f S ≥ f N = 0 . Hence, c A = γ , c S = γ f S , c N = 0 . SKIP SEQUENCING 7 As indicated in the I n tro duction , w e pr op ose measurement of t he informa- tiv eness of a design option by the size of the iden tiﬁcation region obtained for the parameter of interest. In general, the size of an identiﬁcatio n region dep end s on the sp eciﬁed parameter, the data pro du ced by a design option, and the assumptions that the plann er is willing to maint ain. Sections 4 and 5 sh o w ho w in some leading cases. 4. Question design with nonresp onse. This section examines h ow non- resp onse aﬀects c h oice among the three design options. T o fo cus atten tion on the inferentia l problem created b y nonresp onse, we assu me that when sample mem b ers d o resp ond, all answ ers are accurate. Section 4.1 considers iden tiﬁ cation of the parameter τ [ P ( y )]. S ection 4.2 sho ws how to use the ﬁndings to choose a design. Section 4.3 uses questions on future generosit y of S o cial Secur it y to illustr ate. 4.1. Identiﬁc ation with nonr esp onse. It has b een common in survey re- searc h to im p ute missing v alues and to use these imputations as if they are real d ata. Stand ard imputation metho ds presume th at data are missing at random (MAR), conditional on sp eciﬁed observ able co v ariates; see Little and Rubin ( 1987 ). If the main tained MAR assu mptions are correct, then parameter τ [ P ( y )] is p oint -iden tiﬁed und er b oth of design options A and S . Option S is less costly , so there is no reason to conte m plate option A from the p ersp ectiv e of iden tiﬁcation. If option A is used in practice, the rea- son m us t b e to pr ovide a larger sample of observ ations in order to impr ov e statistica l inferen ce. Iden tiﬁcation b ecomes the dominan t concern when, as is often the case, a surve y planner h as only a w eak understandin g of t he distribu tion of missing data. W e fo cus h ere on the worst-ca s e setting, in whic h the planner knows nothing at all ab out the m issing data. It is straigh tforw ard to determine the iden tiﬁ cation region for τ [ P ( y )] und er design options A and S . W e dra w on Manski [( 2003 ), Ch apter 1] to show ho w. Option A . T o formalize the ident iﬁcation problem created b y nonre- sp onse, let eac h mem b er j of a p opulation J ha ve an ou tcome y j in a sp ace Y ≡ [0 , s ]. Here s can b e ﬁnite or can equ al ∞ , in whic h case Y is the nonneg- ativ e p art of the extended real line. T he assu mption th at y is nonnegativ e is not crucial for our analysis, bu t it simpliﬁ es the exp osition and notation. The p opu lation is a probability s p ace and y : J → Y is a random v ariable with distrib ution P ( y ). L et a samp lin g pro cess draw p ersons at random from J . Ho wev er, not all realizati ons of y are observ able. Let the realizatio n of a binary random v ariable z A y indicate observ ab ility; y is observ able if z A y = 1 and not observ able if z A y = 0. Th e sup erscript A sho ws the dep endence of observ ab ility of y on d esign option A . 8 C. F. MANSKI A ND F. MOLINA RI By the La w of T otal Probabilit y , P ( y ) = P ( y | z A y = 1) P ( z A y = 1) + P ( y | z A y = 0) P ( z A y = 0) . (1) The sampling pro cess rev eals P ( y | z A y = 1) and P ( z A y ) , bu t it is uninformative regarding P ( y | z A y = 0). Hence, th e sampling pro cess p artially ident iﬁes P ( y ). In p articular, it rev eals that P ( y ) lies in the identiﬁcati on region H A [ P ( y )] ≡ [ P ( y | z A y = 1) P ( z A y = 1) + ψ P ( z A y = 0) , ψ ∈ Ψ Y ] . (2) Here Ψ Y is the space of all pr obabilit y distributions on Y and the su p erscript A on H sho ws the d ep end ence of the id en tiﬁcation region on the design option. The identiﬁca tion region for a parameter of P ( y ) follo ws immediately from H A [ P ( y )]. Consid er inference on the parameter τ [ P ( y )]. Th e iden tiﬁcation region consists of all p ossible v alues of the p arameter. Thus, H A { τ [ P ( y )] } ≡ { τ ( η ) , η ∈ H A [ P ( y )] } . (3) Result ( 3 ) is simple but is to o abstract to b e usefu l as s tated. Researc h on partial identiﬁcat ion has sough t to c h aracterize H A { τ [ P ( y )] } for d iﬀeren t parameters. Manski ( 1989 ) d o es this for means of b ounded functions of y , Manski ( 1994 ) for quan tiles, and Manski [( 2003 ), Chap ter 1] for all parame- ters that r esp ect ﬁr st-order sto chastic d omin ance. Blund ell et al. ( 2007 ) and Sto ye ( 2005 ) c h aracterize the iden tiﬁcation regions for sp read parameters suc h as the v ariance, in terquartile range and the Gini co eﬃcien t. The resu lts for means of b ounded functions are easy to deriv e and instruc- tiv e, so w e fo cus on these parameters here. T o fur ther simplify the exp osition, w e restrict atten tion to monotone functions. Let ℜ b e the extended r eal line. Let g ( · ) b e a m onotone fun ction that maps Y in to ℜ and that attains ﬁ - nite lo wer and up p er b ound s g 0 ≡ min y ∈ Y g ( y ) = g (0) and g 1 ≡ max y ∈ Y g ( y ) . Without loss of generalit y , b y a normalization, w e set g 0 = 0 and g 1 = 1 . Th e problem of inte rest is to infer E [ g ( y )]. The Law of Iterated Exp ectations gives E [ g ( y )] = E [ g ( y ) | z A y = 1] P ( z A y = 1) + E [ g ( y ) | z A y = 0] P ( z A y = 0) . (4) The sampling pr o cess rev eals E [ g ( y ) | z A y = 1] and P ( z A y ) , but it is u n infor- mativ e regarding E [ g ( y ) | z A y = 0] , which can tak e an y v alue in the in terv al [0 , 1]. Hence, the identiﬁca tion region for E [ g ( y )] is the closed in terv al H A { E [ g ( y )] } = [ E [ g ( y ) | z A y = 1] P ( z A y = 1) , (5) E [ g ( y ) | z A y = 1] P ( z A y = 1) + P ( z A y = 0)] . H A { E [ g ( y )] } is a prop er s u bset of [0 , 1] when ever P ( z A y = 0) is less than one. The width of the region is P ( z A y = 0). T hus, the severit y of the iden tiﬁ cation problem v aries directly w ith the prev alence of missing d ata. SKIP SEQ U ENCING 9 Option S . There are tw o sour ces of nonresp onse un der option S . First, a s amp le m em b er m a y not resp ond to the op en in g question, in whic h case she is not ask ed ab out item y . S econd, a samp le mem b er ma y resp ond to the op ening question but n ot to th e subs equ en t question ab out item y . Let x denote the item whose v alue is sough t in the op ening question. As in Illu strations 1 and 2 , we supp ose that x is a b road item and that y is a m ore sp eciﬁc one. F or simplicit y , w e su pp ose here that x ∈ { 0 , 1 } and that x = 0 = ⇒ y = 0. A resp ondent is asked ab out y only if she answers the op enin g question an d rep orts x = 1. F or example, consid er Illustration 2 discussed in th e In tro duction . If a resp ondent d o es not h a ve any limitation in ADLs ( x = 0), th en clearly the resp ondent do es not hav e a limitation in bathing/sho wering ( y = 0). Hence, the NLSOM asks ab out y only when a resp ond en t rep orts x = 1 . T o formalize the identiﬁcat ion problem, w e need t wo resp onse in dicators, z S x and z S y , the su p erscrip t S sho wing th e dep endence of nonresp onse on design option S . Let z S x = 1 if a resp ondent answ ers the op enin g question and let z S x = 0 otherwise. Let z S y = 1 if a resp ondent who is asked th e f ollo w-up question giv es a resp onse, with z S y = 0 otherwise. Hence, z S y = 1 = ⇒ z S x = 1. This and the Law of Iterated Exp ectations and the fact that g (0) = 0 give E [ g ( y )] = E [ g ( y ) | x = 1] P ( x = 1) + E [ g ( y ) | x = 0] P ( x = 0) = E [ g ( y ) | x = 1 , z S y = 1] P ( z S y = 1 , x = 1) + E [ g ( y ) | x = 1 , z S x = 1 , z S y = 0] P ( z S x = 1 , z S y = 0 , x = 1) + E [ g ( y ) | x = 1 , z S x = 0] P ( z S x = 0 , x = 1) . The sampling pro cess revea ls E [ g ( y ) | x = 1 , z S y = 1], P ( z S x = 1 , z S y = 0 , x = 1), and P ( z S y = 1) = P ( z S y = 1 , x = 1) , w here the last equalit y holds b ecause z S y = 1 = ⇒ x = 1 . The data are un informativ e ab out E [ g ( y ) | x = 1 , z S x = 1 , z S y = 0] and E [ g ( y ) | x = 1 , z S x = 0] , whic h can tak e an y v alues in [0 , 1] . T he data are partially inform ativ e ab out P ( z S x = 0 , x = 1), whic h can tak e any v alue in [0 , P ( z S x = 0)]. It follo w s th at th e identiﬁcatio n region for E [ g ( y )] is the closed int erv al H S { E [ g ( y )] } = [ E [ g ( y ) | z S y = 1] P ( z S y = 1) , E [ g ( y ) | z S y = 1] P ( z S y = 1) (6) + P ( z S x = 1 , z S y = 0 , x = 1) + P ( z S x = 0)] . Th us, the sev erity of the iden tiﬁcation pr oblem v aries directly with the prev alence of nonresp onse to the op en ing question and to the follo w -u p question in the subp opulation in which it is ask ed. 10 C. F. MANSKI A ND F. MOLINA RI 4.2. Cho osing a design. No w consid er choic e among the three d esign options ( A, S, N ). Th e widths of the iden tiﬁcation regions for E [ g ( y )] under these options are as follo ws: d A = P ( z A y = 0) , d S = P ( z S x = 1 , z S y = 0 , x = 1) + P ( z S x = 0) , d N = 1 . F or sp eciﬁcit y , let the loss fun ction ha ve the lin ear form L k = γ f k + d k . The ﬁ rst comp onen t measures sur vey cost and the second measures the informativ eness of the d esign option. W e set the co eﬃcien t on d k equal to one as a normalization of scale. The parameter γ measures th e im p ortance that the su rv ey plann er giv es to cost relativ e to inform ative ness. T here is no u niv ersally “correct” v alue of this parameter. Its v alue is something that the survey planner must sp ecify , dep end ing on the sur vey cont ext and the nature of item y . It follo ws fr om the ab o ve and fr om th e deriv ations of Section 4.1 that the losses asso ciated with the thr ee design options are as follo ws: L A = γ + P ( z A y = 0) , L S = γ P ( z S x = 1 , x = 1) + P ( z S x = 1 , z S y = 0 , x = 1) + P ( z S x = 0) , L N = 1 . Th us, it is optimal to administer item y to all sample mem b ers if γ + P ( z A y = 0) ≤ min { 1 , γ P ( z S x = 1 , x = 1) + P ( z S x = 1 , z S y = 0 , x = 1) + P ( z S x = 0) } . Skip s equ encing is optimal if γ P ( z S x = 1 , x = 1) + P ( z S x = 1 , z S y = 0 , x = 1) + P ( z S x = 0) ≤ min { 1 , γ + P ( z A y = 0) } . If neither of th ese inequalities hold, it is optimal not to ask the item at all. Determination of the optimal design option requir es kn o wledge of the re- sp onse rates that would o ccur u nder options A and S . Th is is wh ere the b o d y of s u rve y researc h r eview ed by Krosnic k ( 1999 ) has a p oten tially im- p ortant r ole to p la y . Thr ough th e use of randomized exp erimen ts em b edded in su rv eys, researc hers ha ve devel op ed considerable knowle dge of the re- sp onse rates that o ccur when v arious types of questions are p osed to d iv erse p opulations. I n man y cases, this b o dy of kno wledge can b e brought to b ear to pro v id e credible v alues for the r esp onse rates th at determine loss under options A an d S . When the literature do es not pr o vide credible v alues for these resp onse rates, a su rv ey planner ma y w ant to p erform his own pretest, randomly assigning sample mem b ers to options A and S . The size of the pretest sample only needs to b e large enough to determine with reasonable conﬁ dence whic h design option is b est. It do es not need to b e large enough to giv e precise estimates of the r esp onse r ates. SKIP SEQ U ENCING 11 4.3. Questioning ab out exp e ctations on the gener osity of so cial se curity. Consider the questions on exp ectations for the fu ture generosit y of the So cial Securit y program cited in Illustration 1 . The op ening question w as p osed to 10,748 resp ondents to the 2006 HRS wh o currently r eceiv e so cial security b eneﬁts, and the follo w -up was ask ed to the sub-sample of 9356 p ersons who answ ered the op ening question and ga ve a resp onse greater than zero. W e assume here that the only data pr oblem is nonresp onse. The nonr esp onse rate to the op ening question wa s 7.23%. The nonresp onse rate to the follo w- up qu estion, for the su bsample asked th is question, was 2.27%. It is p lausible that someone may n ot b e willing to resp ond to the ﬁr st qu estion and yet b e willing to resp on d to the second one. In particular, this w ould happ en if a p erson do es not w ant to sp eculate on w hat Congress will do but, nev er- theless, is su re that if Congress do es act, it w ould only change b eneﬁ ts for future retirees, n ot for those already in the system. The HRS use of skip sequencing prev ents observ ation of y in suc h cases. T o cast this application into the n otation of the p revious section, w e let x = 1 if a resp ond en t places a p ositiv e probabilit y on C ongress acting, with x = 0 otherwise. The rest of the notation is the same as ab o v e. An early release of the HRS data pr o vide th ese empirical v alues for the quan tities that d etermine th e identiﬁcatio n region for E [ g ( y )] and loss under design option S : P ( z S x = 1 , z S y = 0 , x = 1) = 0 . 0197 , P ( z S x = 1 , x = 1) = 0 . 8705 , P ( z S y = 1) = 0 . 8508 , P ( z S x = 0) = 0 . 0723 , E [ g ( y ) | z S y = 1] = 0 . 4039 , where g ( y ) ≡ y 100 . Hence, the identiﬁcat ion region for E [ g ( y )] under option S is H S { E [ g ( y )] } = [0 . 3436 , 0 . 4356 ] and loss is L S = 0 . 870 5 γ + 0 . 0920. The HRS data d o not rev eal the quantiti es that determine the identiﬁca- tion region f or E [ g ( y )] and loss under design option A . F or this illustration, w e conjecture that the mean r esp onse to item y that w ould b e obtained under option A equals the mean resp onse that is obs er ved under option S . T h us, E [ g ( y ) | z A y = 1] = 0 . 4039. W e sup p ose fu rther that the nonresp onse proba- bilit y wo u ld b e P ( z A y = 0) = 0 . 08. Then the identiﬁca tion region for E [ g ( y )] under option A is H A { E [ g ( y )] } = [0 . 3716 , 0 . 4516] and loss is L A = γ + 0 . 08. It follo ws fr om the ab o v e that it is optimal to adm inister item y to all sample m em b ers if γ ≤ 0 . 092 7 . 12 C. F. MANSKI A ND F. MOLINA RI Skip s equ encing is optimal if 0 . 0927 ≤ γ ≤ 1 . 0431 . If neither of th ese inequalities hold, it is optimal not to ask the item at all. 5. Question d esign with data errors. Th is section examines h o w re- sp onse err ors aﬀect c h oice among the three design options. T o fo cus at- ten tion on the inferential problem created by suc h errors, w e assume that all sample memb er s resp ond to the questions p osed. Section 5.1 considers iden tiﬁ cation. Section 5.2 sho ws h o w to use the ﬁ n dings to c ho ose a design. Section 5.3 uses questions on limitations in ADLs to illustrate. 5.1. Identiﬁc ation with r esp onse err ors. Section 4 sho wed that assump- tions ab out the distrib ution of missing data are unnecessary for partially informativ e inference in the pr esence of nonresp onse. In con trast, assum p - tions on the n ature or prev alence of resp onse err ors are a prerequisite for inference. In cases wher e y is d iscrete, it is natural to think of d ata errors as classiﬁcation errors. W e conceptualize resp onse error here through a mis- classiﬁcation mo del p r eviously used b y Molinari ( 2003 , 2008 ), and we draw on her ﬁndings. The App endix discuss es the mixture mo del of data errors, whic h yields equiv alen t resu lts b eginning from a diﬀeren t conceptualizat ion of d ata errors. The misclassiﬁcation m o del is a simple formalism that do es not h a ve con- ten t p er se. It b ecomes informativ e when it is com bin ed w ith an assumed upp er b oun d on the pr ev alence of d ata errors. When suc h a b ound is a v ail- able, Molinari ( 2003 ) sh o wed that E [ g ( y )] is p artially identiﬁed und er design option A . It is straight forw ard to sh ow the same u nder option S . T o simplify the exp osition, w e fo cu s here on the p articularly simple case wh ere y ∈ { 0 , 1 } and g ( y ) ≡ y . Corr esp ondin g r esu lts for general d iscrete Y and any b ound ed function g ( · ) : Y → [0 , 1] may b e obtained fr om the authors. Option A . As in S ection 4 , let eac h mem b er j of a p opu lation J hav e an outcome y j and let P ( y ) b e the p opulation distribution of y . Let a sam- pling pro cess dr a w p ersons at r andom from J . Let ˜ y : J → Y denote the resp onses that p opulation mem b ers w ould giv e when queried ab out y . The researc her observes realizatio ns of ˜ y, which can either equal or diﬀer from the corresp onding realizations of y . When ˜ y 6 = y , data errors o ccur. The misclassiﬁcation m o del b egins with the basic observ ation that, by the La w of T otal P r obabilit y ,  P ( ˜ y A = 1) P ( ˜ y A = 0)  =  P ( ˜ y A = 1 | y = 1) P ( ˜ y A = 1 | y = 0) P ( ˜ y A = 0 | y = 1) P ( ˜ y A = 0 | y = 0)   P ( y = 1) P ( y = 0)  . SKIP SEQ U ENCING 13 The su p erscript A sho ws that the resp onse ˜ y A dep end s on design option A . The samplin g p r o cess revea ls only P ( ˜ y A ) , whic h p er se is u ninformativ e ab out P ( y ). T he b asic main tained assumption is a kno wn non trivial lo w er b ound 1 − λ A > 0 on the probabilit y that the realizations of ˜ y A and y coin- cide, or, strengthening this assump tion, a known n on tr ivial lo wer b ound on the probabilit y of correct rep ort f or eac h v alue that y can take. F ormally , these assu m ptions are as follo ws : Assumpt ion 1. P ( y = ˜ y A ) ≥ 1 − λ A > 0 . Assumpt ion 2. P ( ˜ y A = k | y = k ) ≥ 1 − λ A > 0 , ∀ k ∈ Y . Molinari ( 2003 ) sh o ws that, under Assump tion 1 , H A [ P ( y = 1)] = [0 , 1] ∩ [ P ( ˜ y A = 1) − λ A , P ( ˜ y A = 1) + λ A ] , (7) while, u nder Assu mption 2 , H A [ P ( y = 1)] = [0 , 1] ∩  P ( ˜ y A = 1) − λ A 1 − λ A , P ( ˜ y A = 1) 1 − λ A  . (8) Observe that these identiﬁcati on regions yield informativ e lo we r and upp er b ound s on P ( y = 1) w h en λ A ≤ P ( ˜ y A = 1) ≤ 1 − λ A . Results ( 7 ) and ( 8 ) w ere der ived earlier b y Horo witz and Manski ( 1995 ), using a diﬀerent formalizatio n of data errors. They studied partial ident iﬁ- cation of probab ility distributions under the mixture mo d el of data errors used in studies of robust inference follo win g Hub er ( 1964 ). Their main as- sumption w as the av ailabilit y of an upp er b ound on the prev alence of data errors as deﬁn ed in the mixture mo d el, just as Hub er assu med in his s emin al researc h. See the App endix f or fur ther discussion of the relationship b et ween the m ixtu re mo d el and the misclassiﬁcation mo del. Option S . Ther e are tw o s ources of p oten tial resp onse error und er op tion S . First, a sample mem b er may resp ond with error to the op enin g question. Then she is erroneously n ot ask ed the follo w u p question if she giv es a false negativ e answer, and she is err oneously ask ed the f ollo w up question if she giv es a false p ositiv e answ er. S econd, a sample m em b er may (truthf ully) resp ond aﬃrm ativ ely to the op ening qu estion and then resp ond with error to the follo w u p. As in Section 4 , we let y d enote the true v alue of the v ariable of in terest and x denote the true v alue of the v ariable elicited in the op ening question. The error ridden ve rsions of these v ariables are ˜ y S and ˜ x S resp ectiv ely . As in Section 4 , skip sequencing has certain logical implications wh en the op ening question inquires broadly ab out a sub ject and the f ollo w up inqu ires more 14 C. F. MANSKI A ND F. MOLINA RI sp eciﬁcally . These logical r elations are x = 0 = ⇒ y = 0 and ˜ x S = 0 = ⇒ ˜ y S = 0. The misclassiﬁcation mo del b egins with the observ ation that, b y the La w of T otal Probabilit y , P ( ˜ x S = i, ˜ y S = k ) = X l =0 , 1 X m =0 , 1 P ( ˜ x S = i, ˜ y S = k | x = l , y = m ) P ( x = l, y = m ) , i, k ∈ { 0 , 1 } . The sampling pro cess reve als only th e quant ities P ( ˜ x S = i, ˜ y S = k ) on the left-hand side of these equ ations, w ith the logic of sk ip s equencing imply- ing that P ( ˜ x S = 1 , ˜ y S = 1) = P ( ˜ y S = 1), P ( ˜ x S = 0 , ˜ y S = 0) = P ( ˜ x S = 0) and P ( ˜ x S = 0 , ˜ y S = 1) = 0. T he logic of skip sequencing also implies that P ( x = 1 , y = 1) = P ( y = 1) , P ( x = 0 , y = 0) = P ( x = 0) and P ( x = 0 , y = 1) = 0. The observ able quant ities and logical restrictions p er se are uninf orm a- tiv e ab out P ( y ) , but they b ecome inform ativ e when combined with these extensions of Assump tions 1 and 2 : Assumpt ion 3. P ( x = ˜ x S , y = ˜ y S ) ≥ 1 − λ S > 0 . Assumpt ion 4. P ( ˜ x S = i, ˜ y S = k | x = i, y = k ) ≥ 1 − λ S > 0 , i, k ∈ { 0 , 1 } , k ≤ i . Extension of the argument of Molinari ( 2003 ) sho ws that, under Assu mp- tions 3 , H S [ P ( y = 1)] = [0 , 1] ∩ [ P ( ˜ y S = 1) − λ S , P ( ˜ y S = 1) + λ S ] , (9) while, u nder Assu mption 4 , H S [ P ( y = 1)] = [0 , 1] ∩  P ( ˜ y S = 1) − λ S 1 − λ S , P ( ˜ y S = 1) 1 − λ S  . (10) These iden tiﬁcation regions yield informative lo we r and upp er b ounds on P ( y = 1) when λ S ≤ P ( ˜ y S = 1) ≤ 1 − λ S . Whereas Assum ptions 1 and 2 only concerned the coincidence of th e true and rep orted v alues of y , Assu mptions 3 and 4 concern the join t coincidence of the true and r ep orted v alues of ( x, y ). Hence, it is reasonable to think that a survey plann er will ordinarily sp ecify a higher low er b ou n d in th e ﬁrst case than the second; that is, 1 − λ A > 1 − λ S . SKIP SEQ U ENCING 15 5.2. Cho osing a design. No w consid er choic e among the three d esign options. The width of the identiﬁcati on region for P ( y = 1) un der option N remains d N = 1 , and therefore, the loss asso ciated with this option is L N = 1. F or simplicit y , we fo cus here on the case when the identi ﬁcation regions under Options A and S yield in formativ e lo w er and up p er b oun ds; that is, λ k ≤ P ( ˜ y k = 1) ≤ 1 − λ k , k ∈ ( A, S ) . T ab le 1 conta in s the r esults for other cases. Under Assump tions 1 and 3 , the widths of th e identiﬁcat ion r egions f or P ( y = 1), un der design options A an d S, are d k = 2 λ k , k ∈ ( A, S ). Therefore, the losses asso ciated with these t wo design options are L A = γ + 2 λ A , L S = γ P ( ˜ x S = 1) + 2 λ S . Th us, it is optimal to ask ab out item y to all sample memb ers if γ + 2 λ A ≤ min { 1 , γ P ( ˜ x S = 1) + 2 λ S } . Skip s equ encing is optimal if γ P ( ˜ x S = 1) + 2 λ S ≤ min { 1 , γ + 2 λ A } . If neither of th ese inequalities hold, it is optimal not to ask the item at all. Under Assump tions 2 and 4 , the widths of th e identiﬁcat ion r egions f or P ( y = 1) are d k = λ k 1 − λ k , k ∈ ( A, S ). Therefore, the losses are L A = γ + λ A 1 − λ A , L S = γ P ( ˜ x S = 1) + λ S 1 − λ S . Th us, it is optimal to ask ab out item y to all sample memb ers if γ + λ A 1 − λ A ≤ min  1 , γ P ( ˜ x S = 1) + λ S 1 − λ S  . T able 1 V alue of L k dep ending on the r elationship b etwe en λ k and P ( ˜ y k = 1) , k ∈ ( A, S ) Assumptions 1 and 3 Assumptions 2 and 4 1 − λ A ≤ P ( ˜ y A = 1) ≤ λ A L A = γ + 1 L A = γ + 1 P ( ˜ y A = 1) ≤ min { λ A , 1 − λ A } L A = γ + P ( ˜ y A = 1) + λ A L A = γ + P ( ˜ y A =1) 1 − λ A λ A ≤ P ( ˜ y A = 1) ≤ 1 − λ A L A = γ + 2 λ A L A = γ + λ A 1 − λ A P ( ˜ y A = 1) ≥ max { λ A , 1 − λ A } L A = γ + 1 − P ( ˜ y A = 1) + λ A L A = γ + 1 − P ( ˜ y A =1) 1 − λ A 1 − λ S ≤ P ( ˜ y S = 1) ≤ λ S L S = γ δ S x + 1 L S = γ δ S x + 1 P ( ˜ y S = 1) ≤ min { λ S , 1 − λ S } L S = γ δ S x + P ( ˜ y S = 1) + λ S L S = γ δ S x + P ( ˜ y S =1) 1 − λ S λ S ≤ P ( ˜ y S = 1) ≤ 1 − λ S L S = γ δ S x + 2 λ S L S = γ δ S x + λ S 1 − λ S P ( ˜ y S = 1) ≥ max { λ S , 1 − λ S } L S = γ δ S x + 1 − P ( ˜ y S = 1) + λ S L S = γ δ S x + 1 − P ( ˜ y S =1) 1 − λ S Note . δ S x ≡ P ( ˜ x S = 1) . 16 C. F. MANSKI A ND F. MOLINA RI Skip s equ encing is optimal if γ P ( ˜ x S = 1) + λ S 1 − λ S ≤ min  1 , γ + λ A 1 − λ A  . If neither of th ese inequalities hold, it is optimal not to ask the item at all. Determination of the optimal d esign option r equires information on the nature and prev alence of r esp onse errors und er options A and S . There ha ve b een o ccasional v alidation and reliabilit y studies do cum en ting the exten t of measurement error in survey items; see, for example, Gro ves ( 1989 ) and Bound, Brown and Mathio wetz ( 2001 ). Wh en the literature d o es n ot pro vide credible u pp er b ound s for the prob ab ility of d ata errors, a sur vey planner ma y wa n t to p erf orm his o w n pretest, r an d omly assigning samp le memb ers to options A and S , and then obtain corresp onding v alidation or reliabilit y data. As in S ection 4 , the size of the pretest sample only needs to b e large enough to determine with reasonable conﬁdence wh ic h design option is b est. It do es not need to b e large enough to give p recise estimates of the upp er b ound s on the probabilities of data errors. 5.3. Questioning ab out limitations in ADLs. Consider the questions on limitations in ADLs cited in I llu stration 2 . T he op ening qu estion w as p osed to 2092 resp ondents to the 1990 NLS O M, of w hom 92. 45% we re self re- sp ond en ts and 7.55% w ere pro xy resp onden ts. The follo w -ups w er e ask ed to the 192 p ersons w ho resp ond ed to the op ening question and ga ve an aﬃr- mativ e answe r . W e fo cus here on the ﬁrst follo w-up ADL question: “No w I w ould like to b e more sp eciﬁc. Because of a h ealth or p h ysical p roblem, do y ou receiv e help fr om another p erson in bathing or show ering?” The n on- resp onse r ate to the op ening question w as 0.62%. T he nonresp onse rate to the follo w-up question, for the subsamp le ask ed this question, w as 0.52%. Giv en these minimal nonresp onse rates, we abstract from nonr esp onse here and concent rate our atten tion on resp onse error. T o k eep this illustration simple, we sup p ose here th at the question on bathing or show ering is th e only follo w up to the NLSOM op ening ques- tion on limitations in ADLs. A more realistic analysis would joint ly consider the six follo w up questions that actually app ear in the sur v ey . This is a straigh tforward extension of our analysis if one main tains the “marginalist” assumption th at the design chosen for th e set of ADL items do es not aﬀect data qualit y elsewhere in the su r v ey . W e th ink this assumption reasonable, b ecause the NLSOM contai ns only six easily und ersto o d q u estions on lim- itations in sp eciﬁc ADLs. Item nonresp onse to th ese questions is m inimal. Item nonresp on s e also w as minimal when similar questions were ask ed in the AHEAD survey , describ ed b elow, whic h do es not u se skip sequencing. W e caution that th ere are circumstances in whic h skip sequencing a voids ha ving to ask some resp onden ts a long, lab orious s equ ence of irrelev an t ques- tions. When this is the case one ma y , as n oted in S ection 3.1 , think th at the SKIP SEQ U ENCING 17 skip sequencing decision ma y materially aﬀect r esp ond en ts’ w illingness or abilit y to provide reliable r esp onses throughou t the s u rve y . When resp on- den t burd en is a p oten tial concern, one ma y ﬁ nd it necessary to mo v e aw a y from simp le marginalist analysis of the type w e p erform and instead treat the d esign of the ent ire questionnaire as a complex joint decision pr oblem. F or this illustration, we tak e the parameter of int erest to b e the cross- sectional probabilit y P ( y = 1) that an individual in the p opulation rep- resen ted by the NLSOM needs h elp in bathing/sho wering. This is one of sev eral parameters of p oten tial in terest when studying limitations in ADL. Connor et al. ( 2006 ) emphasize the im p ortance of longitudinal measurement of the duration of d isabilit y and of transitions in and out of disabilit y . Con- cern with these matters m igh t lead one to b e interested in P [ y ( t ) − y ( t − k )] or P [ y ( t ) | y ( t − k )] , w here y ( t ) and y ( t − k ) measure limitations in ADLs at t wo interviews spaced k y ears apart. It would b e of in terest to charac terize the ident iﬁcation r egions for these trans ition parameters und er alternativ e questionnaire d esigns. Consider P ( y = 1). The r ep orted probab ility is P ( ˜ y S = 1) = 0 . 073. T o ap- ply the misclassiﬁcation mo d el, we need to s et v alues for th e up p er b oun d s λ A and λ S on th e probabilit y of o ccurrence of data errors un d er options A and S . W e are n ot a ware of v alidation studies placing u p p er b ounds on the probabilit y of data errors in self rep orts of limitations in ADLs for p opu- lations similar to the one surve y ed by the NLS OM, u nder design option S . Ho we v er, there h a ve b een stud ies that compare s elf rep orts and proxy re- p orts, as w ell as some that assess the time series consistency of self r ep orts across in terviews. Most of this w ork analyzes surv eys in whic h the ques- tionnaire u ses d esign option A . See, for example, Ru b ens tein et al. ( 1984 ), Mathio w etz and Gro ves ( 1985 ), Moore ( 1988 ), Mathio w etz and L air ( 1994 ), Ro dgers and Miller ( 1997 ), Mathio we tz and W u nderlic h ( 2000 ) and Miller and DeMaio ( 2006 ). In particular, Rub enstein et al. ( 1984 ) and Miller and DeMaio ( 2006 ) r ep ort the r esults of reliabilit y studies pr o vidin g inform ation on th e prev alence of data errors. Rub en stein et al. ( 1984 ) analyze tw o samples of individ uals, one pr o vidin g data on hospitalized elderly p ersons and the other on nursing home resident s. They compare the rep orts of limitations in ADLs and additional daily ac- tivities (suc h as telephoning, shoppin g, hand lin g ﬁnan ces, co oking, etc.) of the institutionalized elderlies and of a “comm u nit y pr oxy” (a sp ouse, c h ild, or close friend) with those of a nurse pro xy . If one assum es th at the rep ort of the nurse pro xy is alwa ys correct, one can conclud e from this stud y that the p robabilit y of a data error is b oun ded ab ov e b y 0.36. Miller and DeMaio ( 2006 ) analyze data on limitations in bathing/sho wering colle cted in the 2006 administration of the American Communit y Sur v ey Con tent T est. Re- liabilit y estimates based on r ein terviews su ggest a p robabilit y of data errors of at most 0.17. 18 C. F. MANSKI A ND F. MOLINA RI The sampling frame and qu estionnaire design of the NLS OM diﬀer f rom the ones analyzed in these r eliabilit y studies. Hence, their ﬁn dings can only b e su ggestiv e for our pur p oses. In what f ollo ws we use the b ounds in As- sumption 5 b elo w. T able 2 collects the results obtained usin g diﬀerent v alues of λ A and λ S , whic h encompass the u pp er b ounds on probabilities of data errors r ep orted by Rub enstein et al. ( 1984 ) and Miller an d DeMaio ( 2006 ). Assumpt ion 5. λ A = 0 . 15 , λ S = 0 . 25 . The iden tiﬁcation regions for P ( y = 1) un der design options A and S are giv en in T able 1 . [Th e forms giv en in S ection 5.2 do not apply here b ecause the inequalities λ k ≤ P ( ˜ y k = 1) ≤ 1 − λ k , k ∈ ( A, S ) do not hold in this ap- plication.] Using λ S = 0 . 25 as th e u pp er b ound on data err ors und er d esign option S , the identiﬁcatio n region for P ( y = 1) is H S [ P ( y = 1)] = [0 , 0 . 3230] under Assumption 3 and H S [ P ( y = 1)] = [0 , 0 . 0973] un der Ass u mption 4 . The data revea l that P ( ˜ x S = 1) = 0 . 092. Hence, loss is L S = 0 . 092 γ + 0 . 3230 under Assu mption 3 , and L S = 0 . 092 γ + 0 . 0973 under Assum ption 4 . The NLSOM data do not rev eal the quan tit y P ( ˜ y A = 1) needed to deter- mine the identiﬁca tion r egion for P ( y = 1) under design option A . F or this illustration, we conjecture that the rate of rep orted limitations in bathing/ sho wering that w ou ld b e obtained under option A equals the rate that is observ ed und er option S . Thus, P ( ˜ y A = 1) = 0 . 073. Using λ A = 0 . 15 as th e upp er b ound on data errors u nder option A , the identiﬁca tion regio n for P ( y = 1) is H A [ P ( y = 1)] = [0 , 0 . 2230] un d er Assum ption 1 and H A [ P ( y = 1)] = [0 , 0 . 0859] un d er Assump tion 2 . Hence, loss is L A = γ + 0 . 2230 u nder Assumption 1 and L A = γ + 0 . 0859 un der Assumption 2 . It follo ws that it is optimal to ask all s ample member ab out item y if γ + 0 . 2230 ≤ min { 1 , 0 . 092 γ + 0 . 323 0 } ⇐ ⇒ γ ≤ 0 . 1101 under Assump tions 1 and 3 , γ + 0 . 0859 ≤ min { 1 , 0 . 092 γ + 0 . 097 3 } ⇐ ⇒ γ ≤ 0 . 0126 under Assump tions 2 and 4 . Skip s equ encing is optimal if 0 . 092 γ + 0 . 3230 ≤ min { 1 , γ + 0 . 2230 } ⇐ ⇒ 0 . 1101 ≤ γ ≤ 7 . 3587 under Assump tions 1 and 3 , 0 . 092 γ + 0 . 0973 ≤ min { 1 , γ + 0 . 0859 } ⇐ ⇒ 0 . 0126 ≤ γ ≤ 9 . 8116 under Assump tions 2 and 4 . Otherwise, it is optimal not to ask the item at all. SKIP SEQ U ENCING 19 T able 2 V alues of γ that determine the choic e of a c ertain design option, dep ending on ( λ A , λ S ) Assumptions 1 and 3 Assumptions 2 and 4 λ A λ S Option A i s chosen Option S is chosen Option A is chosen Op tion S is chosen 0 . 100 0 . 100 Never 0 . 000 ≤ γ ≤ 8 . 989 Never 0 . 000 ≤ γ ≤ 9 . 988 0 . 125 γ ≤ 0 . 027 0 . 027 ≤ γ ≤ 8 . 717 γ ≤ 0 . 003 0 . 003 ≤ γ ≤ 9 . 963 0 . 170 γ ≤ 0 . 077 0 . 077 ≤ γ ≤ 8 . 228 γ ≤ 0 . 007 0 . 007 ≤ γ ≤ 9 . 914 0 . 200 γ ≤ 0 . 110 0 . 110 ≤ γ ≤ 7 . 902 γ ≤ 0 . 011 0 . 011 ≤ γ ≤ 9 . 878 0 . 360 γ ≤ 0 . 286 0 . 286 ≤ γ ≤ 6 . 163 γ ≤ 0 . 036 0 . 036 ≤ γ ≤ 9 . 630 0 . 400 γ ≤ 0 . 330 0 . 330 ≤ γ ≤ 5 . 728 γ ≤ 0 . 045 0 . 045 ≤ γ ≤ 9 . 547 0 . 125 0 . 125 Never 0 . 000 ≤ γ ≤ 8 . 717 Never 0 . 000 ≤ γ ≤ 9 . 963 0 . 170 γ ≤ 0 . 050 0 . 050 ≤ γ ≤ 8 . 228 γ ≤ 0 . 005 0 . 005 ≤ γ ≤ 9 . 914 0 . 200 γ ≤ 0 . 083 0 . 083 ≤ γ ≤ 7 . 902 γ ≤ 0 . 008 0 . 008 ≤ γ ≤ 9 . 878 0 . 360 γ ≤ 0 . 259 0 . 259 ≤ γ ≤ 6 . 163 γ ≤ 0 . 034 0 . 034 ≤ γ ≤ 9 . 630 0 . 400 γ ≤ 0 . 303 0 . 303 ≤ γ ≤ 5 . 728 γ ≤ 0 . 042 0 . 042 ≤ γ ≤ 9 . 547 0 . 170 0 . 170 Never 0 . 000 ≤ γ ≤ 8 . 228 Never 0 . 000 ≤ γ ≤ 9 . 914 0 . 200 γ ≤ 0 . 033 0 . 033 ≤ γ ≤ 7 . 902 γ ≤ 0 . 004 0 . 004 ≤ γ ≤ 9 . 878 0 . 360 γ ≤ 0 . 209 0 . 209 ≤ γ ≤ 6 . 163 γ ≤ 0 . 029 0 . 029 ≤ γ ≤ 9 . 630 0 . 400 γ ≤ 0 . 253 0 . 253 ≤ γ ≤ 5 . 728 γ ≤ 0 . 037 0 . 037 ≤ γ ≤ 9 . 547 0 . 200 0 . 200 Never 0 . 000 ≤ γ ≤ 7 . 902 Never 0 . 000 ≤ γ ≤ 9 . 878 0 . 360 γ ≤ 0 . 176 0 . 176 ≤ γ ≤ 6 . 163 γ ≤ 0 . 025 0 . 025 ≤ γ ≤ 9 . 630 0 . 400 γ ≤ 0 . 220 0 . 220 ≤ γ ≤ 5 . 728 γ ≤ 0 . 033 0 . 033 ≤ γ ≤ 9 . 547 0 . 360 0 . 360 Never 0 . 000 ≤ γ ≤ 6 . 163 Never 0 . 000 ≤ γ ≤ 9 . 630 0 . 400 γ ≤ 0 . 044 0 . 044 ≤ γ ≤ 5 . 728 γ ≤ 0 . 008 0 . 008 ≤ γ ≤ 9 . 547 0 . 400 0 . 400 Never 0 . 000 ≤ γ ≤ 5 . 728 Never 0 . 000 ≤ γ ≤ 9 . 547 W e conclude this section b y calling attenti on to the fact that the 1993 w av e of the Assets and Health Dynamics Among the O ldest Old (AHEAD) surve y targeted a p opulation similar in age to the NLSOM. The AHEAD surve y also ask ed r esp ondents ab out their limitations in ADLs, but it used neither design option A or S . In stead, AHEAD omitted the op ening broad question of the NLSO M and im m ediately p osed a series of sp eciﬁc questions to all resp ondents. T he fraction of AHEAD r esp ondents w h o rep orted limi- tations in bathing/sho wering w as 0.085 , a v alue close to that elicite d in the NLSOM. T o compare the AHEAD and NLSO M designs would require gen- eralizatio n of the decision pr oblem th at we set up in Section 3 . In particular, w e w ould need to tak e into accoun t the loss of information on limitations in ADLs that ma y p oten tially o ccur in AHEAD by dropp ing th e op ening question. 6. Conclusion. S u rve y planners ha ve long had to cop e with the tension b et w een the d esire to reduce the costs and increase the inf ormativ eness of surve ys. How ever, they hav e not studied q u estionnaire d esign as a formal 20 C. F. MANSKI A ND F. MOLINA RI decision p roblem in wh ic h one uses an explicit loss fu nction to quantify the trade-oﬀ b et w een cost and informative ness. Gro v es ( 1987 ) called atten tion to this in an article in Public Opinion Q uarterly (PO Q), w riting (page S167): “The inex tricable link b etw een costs and errors rarely is formally ac knowl - edged in metho d s articles in POQ, or in any oth er scholarly journal for th at matter. That state of aﬀairs has tw o d etrimental eﬀects: (1) method ologists inv ent metho ds to reduce an error, b ut fail t o measure th e cost impact of the new idea, and (2) practitioners reject new ideas u ntil it b ecomes clear that they result in reduced costs. Given the link b etw een errors and costs, many new ideas require sp end ing money to reduce an error.” Gro ves wen t on to contrast the situation in questionnaire design with that in survey s amp ling, wh ic h has long used formal m o dels of cost and sampling error to analyze the p roblem of c ho osing sample size. See also S p encer ( 1980 , 1985 , 1994 ), who has argued broadly for b en eﬁt–cost analysis of programs of d ata collection, with particular atten tion to the U.S. Censu s. This pap er has formally an alyzed skip s equencing as a d ecision p roblem in questionnaire design. W e ha ve inten tionally k ept the exp osition simple in ord er to h ighligh t the b asic trade-oﬀ b et w een cost and informativ eness in choosing a design option. Surve y r esearchers and statisticians with tradi- tional training ma y b e least familiar with our measurement of inf orm ativ e- ness by the s ize of the id entiﬁcatio n region for a p op u lation parameter of in terest. Although identiﬁcati on is th e cen tral problem generated by non r e- sp onse and resp onse errors, the researc h literatures in su rv ey researc h and statistics con tain remark ably little formal analysis of ident iﬁcation. W e think that the illustrativ e cases considered in Sections 4 and 5 give a construc- tiv e sense of how to pro ceed, w ithout getting b ogged do wn in m athematical detail. While iden tiﬁcation is the dominan t issue in assessing data qualit y in large surve ys, sampling error can also b e a signiﬁcan t concern in smaller sur v eys. A straigh tforwa rd extension of our work to smaller s u rve ys is to measure informativ eness through a conﬁden ce in terv al for the partially identiﬁed parameter of inte rest. The literature on partial iden tiﬁcation has recen tly spa w n ed man y approac h es to the construction of asymptotically v alid conﬁ- dence inte r v als. See, for example, Imben s and Manski ( 2004 ), Chern ozhuk o v, Hong and T amer ( 2007 ) and Berestean u and Molinari ( 2008 ). Another ap- proac h, with a ﬁ rmer decision-theoretic f ou n dation, would b e to add ress the questionnaire d esign pr ob lem from the p ersp ectiv e of W ald ( 1950 ). APPENDIX: MIXTURE MODEL AND MISCLAS S IFICA TI ON MODEL The m ixture mo del of robus t statistics int ro duces laten t v ariables e ∈ Y and w ∈ { 0 , 1 } , and views the r ep orted v alues ˜ y as generated by th e mixtur e ˜ y = w y + (1 − w ) e . The unobserv able binary v ariable w d enotes whether y or SKIP SEQ U ENCING 21 e is observ ed. Realizations of ˜ y with w = 1 are said to b e error fr ee an d those with w = 0 are said to b e data errors. By the La w of T otal Pr ob ab ility , th e relationship b et ween the observ able d istr ibution P ( ˜ y ) and th e unobserv able distribution P ( y ) is P ( ˜ y ) = P ( y | w = 1) P ( w = 1) + P ( e | w = 0) P ( w = 0) , (11) P ( y ) = P ( y | w = 1) P ( w = 1) + P ( y | w = 0) P ( w = 0) . (12) The mixture mo del p er se is a formalism without con ten t. It b ecomes infor- mativ e w hen accompanied by assump tion of an upp er b oun d on the o ccur- rence of data errors, as f ollo ws: Assumpt ion A.1. P ( w = 0) ≤ λ < 1 . It is sometimes also assumed that the o ccurr ence of errors is statisticall y indep en d en t of the v alue of y . Th at is, Assumpt ion A.2. y ⊥ w . Horo witz and Manski ( 1995 ) stud ied the imp lications of the mixture mo del for partial iden tiﬁcation of p robabilit y d istributions; see also Man- ski ( 2003 ), C hapter 4. They d eriv ed the iden tiﬁcation region for P ( y ) and for parameters of this distr ibution that resp ect sto chastic dominance, un der Assumption A.1 alone and u nder Assumptions A.1 and A.2 . They refer to the ﬁrst case as “corrupted sampling,” and to the s econd as “con taminated sampling.” The r elationship b etw een the mixtu r e mo del and the misclassiﬁcation mo del can b e easily established starting from equation ( 11 ). Observe that P ( ˜ y = j | y = k ) (13) =    P ( w = 1 | y = k ) + P ( e = k | y = k , w = 0) P ( w = 0 | y = k ) , if j = k , P ( e = j | y = k , w = 0) P ( w = 0 | y = k ) , if j 6 = k . Hence, assumptions on P ( w | y ) tr anslate immediately in to assum ptions f or the misclassiﬁcation m o del. Molinari ( 2003 ) sho ws that if the distr ib ution of e is unrestricted, the mixture mo del with Ass umptions A.1 and A.2 is equiv alent to the misclassiﬁcatio n mo del with an assum p tion sp ecifying a common lo wer b ound on the probabilities of correct rep ort, P ( ˜ y = k | y = k ), k ∈ Y . The mixture mo del with Assumption A.1 alone is equiv alent to the misclassiﬁcation mo del with an assumption sp ecifying a lo wer b ound on the probabilit y that ˜ y and y coincide, P ( ˜ y = y ) . 22 C. F. MANSKI A ND F. MOLINA RI Ac kn o wledgments. W e than k four anon ymous reviewers and the Ed itor for comment s. REFERENCES Beresteanu, A . and Molinari, F. (2008). Asym p totic prop erties for a class of partially identiﬁed mo dels. Ec onometric a. T o app ear. Blundell, R., Gosling, A., Ichim ura, H. and Me ghir, C. (2007). Changes in the distribution of male and female wages accounting for emplo y ment comp osition using b ounds. Ec onometric a 75 323–363. Bound, J., B ro wn, C. and Ma thiowetz, N . A. (2001). Measurement error in survey data. In Handb o ok of Ec onometrics 5 Chapter 59 (J. H ec kman and E. Leamer, eds.) 3705–38 43. North-Holland, Amsterdam. Chernozhuko v, V., Hong, H. and T amer, E. (2007). Estimation and conﬁd ence regions for p arameter sets in econometric mod els. Ec onometric a 75 1243–1284. Connor, J. T. , Fienberg, S . E., Erashov a, A . E. and W hite, T. (2006). T o w ards a restructuring of the national long term care survey: A longitudinal p ersp ective. Pre- pared for presenta tion at an Exp ert Panel Meeting on the National Long T erm Care Survey , Committee on National Statistics, National Research Council. Gro ves, R. M. (1987). Research on survey data qu alit y . The Public Opinion Quarterly 51 S 156–S172. Gro ves, R. M. (1989). Survey Err ors and Survey Costs . Wiley , New Y ork. Gro ves, R. M. and Heerin ga, S. G. (2006). R esp onsive d esign for household surveys: T ools for actively controlli ng survey errors and costs. J. R oy. Statist. So c. Ser. A 169 437–457 . MR2236915 Hill, D. H . (1991). I nterview er, resp ond ent, and regional oﬃce eﬀects on resp onse va ri- ance: A statistical decomp osition. I n Me asur ement Err ors i n Surveys (P . Biemer, R. Gro ves, L. Lyb erg, N. Mathio w etz and S. Sud man, eds.) 463–483. Wiley , New Y ork . Hill, D. H . ( 1993). Resp onse and sequencing errors in surveys: A discrete contagious regression analysis. J. Amer. Statist. Asso c. 88 775–781 . Huber, P. (1964). Robust estimation of a lo cation parameter. Ann. Math. Statist. 35 73–101. MR0161415 Hor o witz , J. L. and Manski, C. F. (1995). Identiﬁcation and robustness with contam- inated an d corrupted data. Ec onometric a 63 281–302. MR1323524 Imbens, G. and Manski, C. F. (2004). Conﬁden ce interv als for partially identiﬁed pa- rameters. Ec onometric a 72 1845–18 57. MR2095534 Kro sn ick, J. (1999). S urvey research. Ann. R ev. Psycholo gy 50 537–567. Little, R. J. and Rubin, D. B. (1987). Statistic al Analysis with Mi ssing Data . Wiley , New Y ork. MR0890519 Manski, C. F. (1989). Anatomy of the selection problem. J. Human R esour c es 24 343– 360. Manski, C. F. (1994). The selection problem. In A dvanc es in Ec onometrics , Sixth World Congr ess I ( C. Sims, ed.) 143–170 . Cam bridge U niv. Press. MR1278269 Manski, C. F . (2003). Partial Identiﬁc ation of Pr ob abil ity Distributions . Springer, New Y ork. MR2151380 Ma thi o wetz, N. A. and Gro ves, R. M. (1985). The eﬀects of resp ondent rules on health survey rep orts. Americ an J. Public He alth 75 639–644. Ma thi o wetz, N. A. and Lair, T. J. (1994). Getting b ett er? Change or error in the measuremen t of functional limitations. J. Ec onomic and So cial Me asur ement 20 237– 262. SKIP SEQ U ENCING 23 Ma thi o wetz, N. A. and Wunderlich, G. S., eds . (2000). Survey Me asur ement of Work Disability . National Academy Press, W ashington, DC. Messmer, D. and Seymour, D. ( 1982). The eﬀect of branching on item nonresponse. Public O pinion Quarterly 46 270–277. Miller, K. and DeMaio, T. J. (2006). R ep ort of cognitive researc h on prop osed Amer- ican comm u nity survey disabilit y questions. U.S. Census Bureau, Statistical Research Division Rep ort #SS M2006/0 6. Molinari, F. (2003). Con t aminated, corrupted, and miss- ing data. Ph.D. thesis, North western Univ. Av ailable at http://www .arts.cor nell.edu/econ/fmolinari/dissertation.pdf . Molinari, F . ( 2008). Partial identiﬁcation of probability distributions with misclassiﬁed data. J. Ec onometrics . T o app ear. Moore, J. C. (1988). Self/proxy resp onse status and survey resp onse quality . J. Oﬃci al Statistics 4 155–172 . Ro dgers, W. L. and Mi ller, B. (1997) . A comparative analysis of ADL q uestions in surveys of older p eople. J. Ger ontolo gy Ser. B 52B (Sp ecial I ssue) 21–36. Rubenstein, L. Z., Schairer, C. , Wie land, G. D. and Kane, R . (1984). Systematic biases in functional status assessment of elderly adu lts: Eﬀects of diﬀerent data sources. J. Ger ontolo gy 39 686–691. Spencer, B. D. ( 1980). Beneﬁt–Cost Analysis of Data Use d to Al l o c ate F unds . S pringer, New Y ork. MR0581534 Spencer, B. D. (1985). Optimal data qualit y . J. Amer. Statist. Asso c. 80 564–573. MR0803257 Spencer, B. D. (1994). Sensitivity of b eneﬁt–cost analysis of data programs to monotone misspeciﬁcation. J. Statist. Plann. Infer enc e 39 19–31. MR1266989 Sto ye, J. (2005). P artial identiﬁcation of spread parameters when some data are missing. Dept. Economics, New Y ork Univ. W ald, A. (1950). Statistic al De cision F unctions . Wiley , N ew Y ork. MR0036976 Dep ar tment of Economics and Institute for Policy Research Nor thwestern University 2001 Sheridan Ro a d Ev anston, Illinois 60 208-260 0 USA E-mail: cfmanski@north we stern.edu Dep ar tment of Economics Cornell University Uris Hall Ithaca, New York 1 4853-760 1 USA E-mail: fm72@cornell.edu

Skip sequencing: A decision problem in questionnaire design

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment