A smoothing model for sample disclosure risk estimation

IMS Lecture Notes–Monograph Series Complex Datasets and In ve rse Problems: T omogra ph y , Net works and Beyond V ol. 54 (2007) 161–171 c  Institute of Mathematical Statistics, 2007 DOI: 10.1214/0749217 07000000120 A smo othing mo de l for sample disclosu re risk estimatio n Y osef Rinott 1 , ∗ and Natalie Shlomo 2 , † Hebr ew Unive rsity and Southampton University Abstract: When a sample frequency table is pub lished, disclosur e risk arises when some individuals can b e ident iﬁed on the basis of their v alues in certain attributes in the table called key variables , and then their v alues in other attributes may b e inf er red, and their priv acy is violated. On the basis of the sample to b e released, and p ossibl y some partial kno wl- edge of the whole population, an agency which considers r eleasing the sample, has to estim ate the disclosure ri sk. Risk ar ises f rom non-empty sample cells which represent small p opulation cells and from p opulation uniques i n particular. Therefor e ri sk estimation r e- quires assessing how man y of the relev ant population cells are l ikely t o b e small. V arious metho ds hav e b een proposed for this task, and we present a metho d in which estimation of a population cell f requency is based on smoothing using a lo cal neighborho o d of this cell, that i s, cells having si milar or close v alues in all attributes. W e provide some prelimi nary r esults and experim ents wi th this method. Comparisons are made to t wo other metho ds: 1 . a lo g-linea r mo dels approac h in which inf erence on a given cell is based on a “neigh borho o d” of cells deter- mined by the log-linear mo del. Such neigh borho ods hav e one or some common attributes with the cell in question, but some other attributes ma y diﬀer sig- niﬁcan tly . 2 The Ar gus metho d in which i nference on a gi v en cell is based only on the sample frequency in the sp eciﬁc cell, on the sample design and on some kno wn marginal distributions of the p opulation, without learning from an y t ype of “neighborho o d” of the given cell, nor f rom any m o del which uses the structure of the table. 1. In tro duction When a micro data sample ﬁle is releas ed by an age nc y , directly iden tifying v ariables, such a s na me, addr ess, etc., are a lwa ys deleted, v ariable v alues ar e often group ed (e.g., Age-Groups instead of pr ecise age), and the da ta is given in the form o f a frequency table. How ever disclosur e risk may still exist, that is, some individuals in the ﬁle may b e identiﬁed by their combination of v alues in the v ariables a ppe a ring in the data. Samples often c o ntain information on certain v ariables on which the a g ency’s information for the who le p opulation is limited, suc h as e x pe nditur e on sp eciﬁc items in a Household Ex pe nditur e Survey , or detailed information on v ariables such as children’s extra curricula r activities in the So cial Survey of the Israel Cen tral Bureau of Statistics . ∗ Researc h supp orted by the Israel Science F oundation (gran t No. 473/04). † Researc h supp orted in part b y the Israel Science F oundation (gran t No. 473/04). 1 Departmen t of Statistics, Hebrew Universit y , Jerusalem, Israel, e-mail: rinott@huji.ac .il 2 Departmen t of Statistics, Hebrew Unive rsity of Jerusalem, Southampton Statistical Sciences Researc h Institute, Universit y of Southampto n, United Kingdom, e-mail : N.Shlomo@ soton.ac. uk AMS 2000 subje c t classiﬁc ations: pr imary 62H17; secondary 62-07. Keywor ds and phr ases: sampl e uniques, neighborho o ds, micro data. 161 162 Y. Rinott and N. Shlomo Often ag encies hav e to assess the dis closure risk inv olv ed in the release of sa mple data in the for m of a frequency table when the corr esp onding po pula tion ta ble may be unknown, or o nly partially k nown. Risk arises from cells in which b oth sa mple and p opulation frequencies a re small, allowing an intruder who has the sample data and access to some information o n the p opulation, a nd in pa rticular on individuals of interest, to identif y such individuals in the sa mple with high probability . Thus, the disclos ure ris k dep ends b oth on the given sa mple, and the p opulatio n. In this pap er we are co ncerned with the issue of estimating dis closure risk inv olved in releasing a sample o n the basis of the sa mple alo ne, assuming the p opulation is unknown. Let f = { f k } denote an m -way frequency table, which is a sa mple from a p op- ulation ta ble F = { F k } , where k = ( k 1 , . . . , k m ) indicates a cell a nd f k and F k denote the frequency in the sample and popula tion cell k , respe c tively . F orma lly , the sample and po pulation sizes in o ur mo dels a re r a ndom and their exp ecta tions are denoted by n a nd N resp ectively , and the num ber of cells by K . W e can ei- ther a ssume that n and N are known, or that they are estimated by their natura l estimators: the actual sample a nd p opulatio n sizes, assumed to b e known. In the sequel when we write n of N we for mally r efer to exp ectatio ns . If the m a ttributes in the table can be consider ed key variables , that is, v ariables which ar e to s ome ex ten t accessible to the public or to p otential intruders, then disclosure risk ar ises from cells in which b o th f k and F k are p ositive and s ma ll, and in par ticular when f k = F k = 1 (sample and po pulation uniques ). Supp ose an int ruder lo cates a sa mple unique in cell k , say , and is aw are of the fact tha t the combination of v a lues k = ( k 1 , . . . , k m ) happ ens to b e unique o r r are in the p op- ulation. If this combination matches an individual of interest to the intruder then ident iﬁcation can b e made with high proba bility on the ba sis of the m a ttributes. If the sample co ntains infor mation o n the v alues of other attributes, then these can now b e infer r ed for the individual in ques tion, a nd his priv acy is violated. In many countries this would constitute a violatio n o f law. F or example The Central Bureau of Statistics in Isr ael op era tes under the Statistics Ordinance (1 9 72) which says “No info rmation. . . shall be so [publishe d] as to enable the identiﬁcation of the per son to who m it r elates”. A glob al risk me asur e quantiﬁes an asp ect of the total r is k in the ﬁle by a ggre- gating risk over the individua l cells. F o r simplicity we shall fo cus here only on tw o global measur es, which are based on sa mple uniques: τ 1 = X k I ( f k = 1 , F k = 1) , τ 2 = X k I ( f k = 1) 1 F k , where I denotes the indicator function. Note that τ 1 counts the num ber of sample uniques which ar e also p op ulation uniques , and τ 2 is the exp ected num ber of c o rrect guesses if each sample unique is ma tched to a ra ndo mly chosen individual from the same p opulation cell. Thes e measures are somewhat a rbitrary , a nd one could consider measures which reﬂect matching of individuals that are not sample uniques, po ssibly with some restrictions on cell s izes. Also, it may make se nse to no rmalize these measures by s ome mea sure of the tota l size o f the table, by the n um be r o f sample uniques, or by some measure of the infor ma tion v alue of the data. V ario us individual and global risk measures have been prop o s ed in the literature, see e.g ., Benedetti et al. [1, 2], Sk inner a nd Holmes [1 2], Elamir and Skinner [6], Rinott [8]. In Section 3 w e pr op ose and e xplain a new metho d of estimation of quantities like τ 1 and τ 2 , using a s ta ndard Poisson mo del, and lo c al smo othing of frequency Disclosur e risk estimation 163 tables. The metho d is based on the idea that one can learn a b o ut a given p opulation cell from neighboring cells, if a suitable deﬁnition o f closeness is po ssible, without relying o n complex mo de ling . In Sections 2.1 a nd 2.2 we brieﬂy des crib e tw o known metho ds of es timation of q uantities like τ 1 and τ 2 , and in Section 4 we provide rea l data exp eriments whic h compare the metho ds discuss ed. W e consider the case tha t f is known, and F is a n unknown para meter (on which there may b e some pa rtial informa tio n) and the quantities τ 1 and τ 2 should b e estimated. Note that they are not prop er par ameters, since they inv olve b o th the sample f and the parameter F . The metho ds discuss e d in this pap er consist of mo deling the co nditional distri- bution of F | f , estimating par ameters in this dis tribution and then using estimates of the form (1) ˆ τ 1 = X k I ( f k = 1) ˆ P ( F k = 1 | f k = 1) , ˆ τ 2 = X k I ( f k = 1) ˆ E [ 1 F k | f k = 1] , where ˆ P a nd ˆ E denote es tima tes of the relev ant conditional probability and expec - tation. F or a g eneral theor y of estimates o f this type see Zhang [14] and references therein. Some direct v ar iance e s timates app ear in Rinott [8]. 2. Mo del s F or completenes s we br ieﬂy introduce the Poisson and Negative Binomia l mo dels. More deta ils can b e found, for example, in Bethlehem et al. [3], Cameron and T rivedi [4], Rinott [8]. A common a ssumption in the frequency table litera ture is F k ∼ Poisson( N γ k ), independently , wher e N is ass umed to b e a known para meter, and P γ k = 1. Binomial (or Poisson) sa mpling from F k means tha t f k | F k ∼ B in ( F k , π k ), where each π k is a known co nstant which is part of the sampling design, c a lled the s ampling fraction in cell k . By standa r d ca lculations we then have (2) f k ∼ P oisson( N γ k π k ) and F k | f k ∼ f k + Poisson( N γ k (1 − π k )) , leading to the Poisson model of subsectio n 2 .1 below. Under this mo del the p opulatio n size is rando m with exp ectatio n N , and so is the sample size, with exp ecta tion N P k γ k π k which we denote b y n . In practice we hav e in mind that N and n could b e estimated by the ac tual p opulation and sample sizes , a nd these estimates could b e “plug ged in” where needed. If o ne adds the Bayesian as sumption γ k ∼ Gamma( α, β ) independently , with αβ = 1 /K to e ns ure that E P γ k = 1, then f k ∼ N B ( α, p k = 1 1+ N π k β ), the Negative Binomia l distribution deﬁned for any α > 0 by P ( f k = x ) = Γ( x + α ) Γ( x + 1)Γ( α ) (1 − p k ) x p α k , x = 0 , 1 , 2 , . . . , which for a natural α co un ts the num ber of failur es until α succe sses o ccur in independent Be rnoulli tr ials with probability of success p k . F urther ca lculations yield F k | f k ∼ f k + N B ( α + f k , N π k +1 /β N + 1 /β ), ( F k ≥ f k ). Note that in this mo de l the po pulation size is again rando m with exp ectation N , and now the s ample size has exp ectation N P k π k /K which we denote ag ain by n . As α → 0 (and hence β → ∞ ) w e o bta in F k | f k ∼ f k + N B ( f k , π k ), which is exactly the Negative Binomia l assumption in Section 2.2 b elow. As α → ∞ the 164 Y. Rinott and N. Shlomo Poisson mo del of Sectio n 2.1 is obtained, and in this sense the Neg a tive Binomial with parameter α subsumes b oth mo dels. Next w e discuss tw o metho ds which hav e received muc h attention. They ha ve bee n applied in so me burea us of statistics recently , and ar e b eing tested by o thers. 2.1. The Poisson lo g-line ar metho d Skinner and Holmes [12] a nd E lamir a nd Skinner [6] prop os ed and studied the following approa ch. Assuming a ﬁxed sampling fractio n, that is, π k = π , the ﬁr st part of (2) implies f k ∼ P oisson( nγ k ), whe r e n = N π . Using the sa mple { f k } one can ﬁt a log-linear mo del using standard progra ms, and obtain estimates { ˆ γ k } of the pa rameters. Go o dness of ﬁt meas ures for se lecting mo de ls having go o d r isk estimates were studied in Skinner and Shlomo [1 1]. Using the second part of (2) it is ea sy to compute individual risk me asu r es for cell k , deﬁned by P ( F k = 1 | f k = 1) = e − N γ k (1 − π k ) , (3) E [ 1 F k | f k = 1] = 1 N γ k (1 − π k ) [1 − e − N γ k (1 − π k ) ] . Plugging ˆ γ k for γ k in (3) leads to the desir ed estimates ˆ P ( F k = 1 | f k = 1) and ˆ E [ 1 F k | f k = 1] and then to ˆ τ 1 and ˆ τ 2 of (1). F or ea ch k we ther efore obtain estimates of P ( F k = 1 | f k = 1) and E [ 1 F k | f k = 1] which dep end on ˆ γ k , which in turn dep ends on the frequencies in other cells. F or example, in a lo g-linear mo del o f indep endence, ˆ γ k depe nds on the frequencie s in all cells which hav e a common a ttribute with k . Thus cells that are ra ther diﬀer ent in nature, having v alues which ar e very diﬀerent fr om those of cell k in most of the attributes, inﬂuence the estimates of the pa rameter γ k per taining to this cell. The main goa l of this pa p e r is to study the p ossibility of estimating γ k using cells in more lo ca l “ neighborho o ds ,” having attribute v alues which ar e c lo ser to those of the cell k in cases where clos e ness ca n be deﬁned. 2.2. The Ar gus metho d This metho d, pro po sed by Be nedetti et al. [1, 2], w as orig ina lly oriented towards in- dividual r isk estimation, but was subsequently also a pplied to globa l risk measures, see, e.g, Polettini a nd Seri [7], and Rino tt [8]. Argus has r e c ent ly b een implemented in some Euro p ean sta tistical burea us. In the Argus mo del it is as sumed that F k | f k ∼ f k + N B ( f k , π k ) with an implicit assumption of indep endence betw een c e lls. Since π k are as sumed known we could now calculate P π k ( F k = 1 | f k = 1 ) and E π k [ 1 F k | f k = 1 ]. Ho wev er b ecause of non resp onse, sa mpling biases and err ors, Argus do es not us e the known π k , but rather estimates them fro m the sampling weights as discussed nex t. A t statistics bure a us, each statistica l unit resp onding to a sa mple sur vey is as- signed a sampling weigh t. This weight w i is an inﬂa ting factor that informs on the numb er of units in the p opula tion that ar e r epr esente d by sample unit i , to b e used for infere nc e fro m the sample to the p opulation. It is calculated b y the in verse sampling fraction that is adjusted for non-resp onse or other biases that may o ccur in the s a mpling pro ces s. These adjustments are often ca rried out within p ost-stra ta (w eighting clas s es) deﬁned by known ma rginal distributions of the p opulations, Disclosur e risk estimation 165 such as Age, Sex and Geographical Lo cation. The in verse sampling fractions are calibrated so that the weigh ted sample count in each p ost-s trata is equal to the known po pulation tota l; this calibratio n reduces under o r over r epresentation of the chosen str ata due to any bia s, or sampling error s. The Argus metho d pr ovides initial estimates of the po pulation cell siz es of the form ˆ F k = P i ∈ cell k w i , wher e w i denotes the s ampling weight of indiv idua l i describ ed ab ove (s e e also example b elow). Here is a simple e xample: Suppo se fo r simplicity that the sampling weight s are ba sed o nly on the s ampling design, and on p o st stratiﬁca tio n by a single v ariable, say Sex, and that the s ample is des igned to be a ra ndom subset c o nsisting o f one p ercent of the p opulation a nd therefore we hav e the same sampling fractio n of π = 1 / 10 0 in each Sex gro up. If males, s ay , hav e a non-resp onse rate of 20%, a nd females of 0%, then the sampling weigh t for women in the sample would b e w i = 1 00, and for men w i = 1 00 / 0 . 8 = 125. If in the sample table there is a cell k = ( k 1 , k 2 ) where k 1 stands for Male, and k 2 stands for the level in another attribute, s uch as Income, and f k = 2 0, then in this cell all w i are 125, and ˆ F k = 20 ∗ 125 = 2500. Now supp os e Sex is not one of the v ariables in the table to b e released, but the agency knows it for a ll individuals in the sample. Supp ose the v ar iables in the ta ble are Income and Occupation, and supp o s e now k = ( k 1 , k 2 ), where k 1 stands for a given Inco me group, and k 2 for a given Oc c upation. Supp ose f k = 20, meaning that in the sample ther e are 2 0 individuals with the given income gr oup and o cc upa tion, a nd supp ose that there ar e 10 ma le s and 10 females in this gr oup. The weigh t w i = 100 for the 1 0 females , and 125 for the 1 0 males, and therefor e ˆ F k = 10 ∗ 10 0 + 1 0 ∗ 125 = 225 0. In the above example sampling w eights re ﬂect non r esp onse. In principle a bureau may a rrive a t suc h weight s a lso b ecause in the original sa mpling desig n men are under r epresented, or be c ause it ﬁnds o ut that this is the case after p os t s tratifying on Sex and obs erving that males are under represented due to some r easons (some bias, including non-r esp onse, or sampling err or). Returning to Argus, re c all its initial estimates of the p opulation cell size s ˆ F k = P i ∈ cell k w i . Using the rela tio n E π k [ F k | f k ] = f k /π k , the parameters π k are esti- mated using the moment-t ype estimate ˆ π k = f k / ˆ F k . Note that if F k were known, this would b e the usual estimate of the binomial sampling pro ba bility . Straightforw ard calc ulations with the Negative Bino mial distribution show P ˆ π k ( F k = 1 | f k = 1) = ˆ π k and E ˆ π k [ 1 F k | f k = 1] = − ˆ π k 1 − ˆ π k log( ˆ π k ) . Plugging these es timates for ˆ P and ˆ E in (1) we obta in the estimates ˆ τ 1 and ˆ τ 2 of the global risk measure s . Note that in this method the cells a r e treated completely independently , ea ch cell at a time, and the str uctur e o f the table, or relations betw een diﬀer e nt cells play no r ole. Moreover, since this metho d do es not inv olve a mo del which reduces the num ber o f para meters, it is required to estimate esse ntially K para meters, which is typically hard in spa r se tables of the k ind we hav e in mind. 3. Smo othi ng p ol ynomials and l o cal neighborho o ds The estimation ques tio n here is essentially the following: given, say , a sa mple unique, how likely is it to b e a lso a p opulatio n unique, or arise from a small po pulation cell. 166 Y. Rinott and N. Shlomo If a sa mple unique is found in a pa r t of the sample ta ble where neighbo ring cells (b y some reasona ble metric, to be discussed later ) are small or empty , then it s eems reasona ble to b elieve that it is mor e likely to have a risen from a small p o pulation cell. This motiv ates our attempt to study lo cal neighborho o ds, and compar e the results to the type of mo del- dr iven neighborho o d as the log-linea r metho d, and the Argus metho d which uses no neighborho o ds . Consider frequency tables in which some of the attributes ar e ordinal, and deﬁne closeness b etw een categories of an attribute in terms of the order , or mor e ge ne r ally , suppo se that for a cer tain a ttribute one can say that some v alues of the attribute are closer to a g iven v alue than others. F or example, Age and Y ears of Education are or dinal attributes, and na tur ally the age o f 5 is closer to 6 than to 7 o r 17, say , while Occupation is not o rdinal, but one can try to deﬁne r easona ble notions of closeness b etw een diﬀer ent o ccupatio ns. Classical log- line a r mo dels do not take such closeness into account, a nd therefore, when such models are used for individual cell para meter estimation, the estimates inv olv e data in cells which may be rather remo te fro m the estimated cell. On the other hand, as mentioned a bove, the Ar gus method bases its estimation only on the s a mpling weigh t of the estima ted p o pulation cell. Ther e is no learning from other cells, the structure of the table plays no ro le, a nd each cell’s parameter is estimated separ ately . W e now describ e our prop os ed approach which cons is ts of using lo cal neighbo r- ho o ds of the estima ted cell. Returning to (2 ) we as sume tha t f k ∼ Poisso n( λ k = N γ k π k ). Apart from con- stants, the sample lo g-likelihoo d is P K k =1 [ f k log λ k − λ k ]. H ow ev er if we use a mo del for λ k which is v alid only in some neig hborho o d M o f a given cell, we shall consider the log -likelihoo d of the data in this neighborho o d, that is (4) X k ∈ M [ f k log λ k − λ k ] . F or conv enience of nota tion we now assume that m = 2, that is, we c o nsider t wo-sa y tables; the extension to any m is str aightforw ard. F ollowing Simo noﬀ [10], see also references therein, we use a lo cal smo othing p olyno mial mo del. F or each ﬁxed k = ( k 1 , k 2 ) separately , we write the mo de l b elow for λ k ′ in terms of the parameter s α =( β 0 , β 1 , γ 1 , . . . , β t , γ t ), with k ′ = ( k ′ 1 , k ′ 2 ) v arying in some neighborho o d of k : log λ k ′ ( α ) ≡ log λ ( k ′ 1 ,k ′ 2 ) (5) = β 0 + β 1 ( k ′ 1 − k 1 ) + γ 1 ( k ′ 2 − k 2 ) + · · · + β t ( k ′ 1 − k 1 ) t + γ t ( k ′ 2 − k 2 ) t , for some natur al num ber t . One can hop e that such a po lynomial mo del is v alid with a suitable t fo r k ′ = ( k ′ 1 , k ′ 2 ) in some neighborho o d M of k = ( k 1 , k 2 ). Substituting (5) into (4) we maximize the conc ave function (6) L ( α ) = L ( β 0 , β 1 , γ 1 , . . . , β t , γ t ) = X ( k ′ 1 ,k ′ 2 ) ∈ M [ f ( k ′ 1 ,k ′ 2 ) log λ ( k ′ 1 ,k ′ 2 ) − λ ( q,r ) ] with r esp ect to the co eﬃcients in α o f the regr ession mo del (5). With arg max L ( α ) = ˆ α , and ˆ β 0 denoting its ﬁrs t comp onent, w e ﬁna lly obtain our es timate of λ k = λ ( k 1 ,k 2 ) in the form (7) ˆ λ k ≡ λ k ( ˆ α ) = exp( ˆ β 0 ) , Disclosur e risk estimation 167 where the s e cond equality is e x plained by taking k ′ = k = ( k 1 , k 2 ) in (5). T he maximization by the Newton- Raphson metho d is r ather straightforward and fast. Each o f the estimates ˆ λ k requires a separate maximiza tion as ab ove which leads to a v alue ˆ α that dep ends on k = ( k 1 , k 2 ), and a set o f estimates λ k ′ ( ˆ α ), of which only ˆ λ k of (7) is used. F or the r isk measur e discus sed in this paper, it suﬃces to compute these es timates for cells k which a re sample uniques, tha t is, f k = 1. Equating the par tial deriv ative of the function of (6) with resp ect to β 0 to zero we obtain P k ′ ∈ M λ k ′ ( ˆ α ) = P k ′ ∈ M f k ′ , and other deriv a tives y ield moment identities. Note, how ever, that thes e desir able identities hold for λ k ′ ( ˆ α ) which ar e obta ined for a ﬁxed k = ( k 1 , k 2 ), and not for our ﬁnal e stimates in (7), which are the one s we use in the sequel. With the estimate of (7), re c a lling λ k = N γ k π k and setting U = { k : f k = 1 } , the set of sample uniques, we now apply the Poisson formulas (3), see also (1), to obtain the r isk es timates (8) ˆ τ 1 = X k ∈ U e − ˆ λ k (1 − π k ) /π k , ˆ τ 2 = X k ∈ U 1 ˆ λ k (1 − π k ) /π k [1 − e − ˆ λ k (1 − π k ) /π k ] . In our exp eriments we deﬁned neighbo r ho o ds M of k by v a rying aro und k co- ordinates co rresp onding to attributes that a re ordina l, and using clo se v a lues in non-ordinal attributes when p ossible (e.g., in O ccupation). Attributes in which closeness of v alues ca nnot b e deﬁned rema in co nstant in the whole neighbo rho o d. Thu s in our ex pe riments, neighborho o ds alwa ys co ns ist of individuals of the same Sex. F o r more details see Section 4. 4. Exp eriments with nei gh b orho o ds W e pr esent a few exp eriments. They are preliminary as alr eady men tioned and more work is needed o n the approach itself and on classifying types of da ta for which it might work. In the exp er iment s we used our own versions o f the Argus and log-linea r mo d- els metho ds, prog rammed on the SAS system. Throughout o ur exp eriments tw o log-linear models ar e consider e d, one of independence of all a ttributes, the o ther including all tw o-wa y interactions. The weight s w i for the Ar gus metho d in a ll our exa mples were computed by po st-stratiﬁca tio n on Sex b y Age b y Geo graphica l lo cation (the latter is not one of the attributes in an y o f the ta bles, but it was used for post- s tratiﬁcation). These v ar iables ar e commonly used fo r p os t-stratiﬁcation, other str ata may give diﬀerent , and p erhaps b etter results. In all exp eriments we to ok a rea l p opulation data ﬁle of siz e N given in the form of a contingency table with K cells, a nd from it we to o k a simple r a ndom sample of size n . Since the p opulation and the sample are known to us , we ca n compute the true values of τ 1 and τ 2 and their estimates by the diﬀere n t metho ds, and c ompare. Example 1. In this small example the po pulation consists of a small ex tr act from the 19 95 Isra eli Census with individuals of age 1 5 and ov er, w ith N = 1 5 , 035 and K = 448. F r om this p opula tion we to o k a ra ndom sample of size n = 1 , 504, using a ﬁxed sampling fraction, that is π k = n/ N for all k . The sampling fraction is consta nt in all our exp eriments. The attributes (with nu mber of levels in parentheses) were Age Groups (32 ), and Income Groups (14), b oth ordinal. As mentioned ab ov e, throughout o ur exp eriments tw o log-linear mo dels are con- sidered, one o f indep e ndenc e , the other including all tw o-wa y interactions (which 168 Y. Rinott and N. Shlomo T a ble 1 Example 1 Example 2 Mo del τ 1 τ 2 τ 1 τ 2 T rue V alues 2 12.4 2 19.9 Argus 7.8 19.6 14.7 37.2 Log Linear Mo del: Independence 0.06 6.7 0.01 9.8 Log Linear Mo del: 2-W ay Inte ractions 0.01 8.6 1.4 19.6 Smoothing t = 1 | M | = 49 3.2 12.0 7.0 22.5 Smoothing t = 2 | M | = 49 1.7 10.4 4.8 19.0 in the present ca se of tw o attr ibutes , is a saturated mo del). In this exp eriment we tried our prop o sed s mo othing p olyno mia l a pproach of (5) for t = 1 , 2. W e co ns id- ered one type of neighborho o d here, constructed by ch anging each attribute v a lue in k by at mo st 3 v alues up or down, that is , the neighborho o d of ea ch cell k is (9) M = { k ′ : max 1 ≤ i ≤ m | k ′ i − k i | ≤ c } , with m = 2, c = 3 a nd hence size | M | = 49 . F or cells near the b ounda r ies some of the cells in their neig hbo rho o ds do not exist; here we set non-exis ting cells’ frequencies to b e zero, but other p os s ibilities can be considered. T able 1 presents the true τ v a lues a nd their estimates by the metho ds describ ed ab ov e. Example 2. The p o pulation consis ts of an ex tract from the 19 9 5 Isr aeli Census, N = 37 , 586 , n = 3 , 759, a nd K = 896. The attributes are Sex (2 ) * Age Groups (32) * Income Gro ups(14). W e applied the smo othing p olynomial of (5) for t = 1 , 2 a nd neig hborho o ds obtained b y v ar ying the attributes of Age and Income as in Example 1 and keeping Sex ﬁxed. In other words we used the neig h b orho o ds (10) M = { k ′ : k ′ 1 = k 1 , max 2 ≤ i ≤ m | k ′ i − k i | ≤ c } , with m = 3, c = 3 w hich are like (9) on each s ub-table of males and females. The results are given in T able 1 . Example 3. Population: an extr a ct from the 1995 Israeli Census. N = 37 , 586 , n = 3 , 759, K = 11 , 64 8. A ttributes: Sex(2) * Age Gr oups (32) * Income Gro ups(14) * Y e a rs of Study (13). W e applied the smo othing po lynomial of (5) for t = 2 and neighbo rho o ds o b- tained by ﬁxing Sex, so neighbor ho o ds are as in (10), but with m = 4 , c = 2 , T a ble 2 Mo del τ 1 τ 2 T rue V alues 187 452.0 Argus 137.2 346.4 Log Linear Mo del: Independence 217.3 518.0 Log Linear Mo del: 2-W ay Inte ractions 167.2 432.8 Smoothing t = 2 | M | = 125 170.7 44 7.9 Disclosur e risk estimation 169 T a ble 3 Mo del τ 1 τ 2 T rue V alues 191 568.0 Argus 79.2 315.6 Log Linear Mo del: Independence 364.8 862.3 Log Linear Mo del: 2-W ay Interactions 182.2 546.2 Smoothing t = 2 | M | = 545 139.6 509.1 Smoothing t = 2 | M | = 625 154.7 528.5 Smoothing t = 2 | M | = 1025 215.7 647.2 T a ble 4 Mo del τ 1 τ 2 T rue V alues 5 36.9 Argus 7.7 35 .5 Log Linear Mo del: Independence 6.4 44 .2 Log Linear Mo del: 2-W ay Inte ractions 1.1 26 .4 Smoothing t = 2 | M | = 125 3.3 31.3 and since we now v a r y three v aria bles, ea ch ov er a range of ﬁv e v alues, we have | M | = 12 5 . The results are given in T able 2. Example 4. Population: an extract from the 2001 UK Census File. N = 94 4 , 793 , n = 1 8 , 896 , K = 15 2 , 100. Attributes: Sex (2) * Age Groups (25 ) * Number of Persons in Ho usehold (9) * Educa tion Qua liﬁcations (13) * Occupa tion (26). W e applied the smo othing p olyno mial of (5) fo r t = 2 and neighborho o ds deﬁned by ﬁxing Sex and v ar ying all other v ariables, including Occupation, which was co ded as ordinal. The neighborho o ds are (11) M = { k ′ : k ′ 1 = k 1 , max 2 ≤ i ≤ m | k ′ i − k i | ≤ c, X i | k ′ i − k i | ≤ d } , with m = 5, c = 2 and d = 6 , 8, resulting in neighborho o d sizes | M | = 545 and 625, resp ectively . W e also tried c = 3 , d = 6 a nd hence | M | = 1 025. The results are given in T able 3. Example 5. P opulation: an extr act from the 1995 Israeli Cens us . N = 24 8 , 983, n = 2 , 4 9 0, K = 8 , 80 0. Att ributes: Sex(2)* Age Gro ups(16) * Y ear s of Study (25) * Occupa tion (11 ) . W e applied the smo othing po lynomial of (5) for t = 2 and neighbo rho o ds o b- tained b y v ar ying three attributes and ﬁxing Sex s o neighborho o ds as in (10) with m = 4 , c = 2, and | M | = 125 . The r esults are given in T able 4. Example 6. Population: an extract from the 199 5 Israeli Census. N = 746 , 94 9 , n = 1 4 , 939, K = 33 7 , 920. Attributes: Sex (2) * Age Groups (1 6 ) * Y ears o f Study (10) * Number of Y ears in Isra el (11) * Income Groups (12) * Number of Persons in Household (8). Note that this is a very s parse table. W e applied the smo othing po lynomial of (5) for t = 2 and neighbo rho o ds o b- tained by v arying all attributes exc ept for Sex which was ﬁxed. Neighborho o ds are as in (11) with m = 6, c = 2, d = 4 , 6, and | M | = 58 1 and 1 , 893, resp ectively . The results are given in T able 5 . 170 Y. Rinott and N. Shlomo T a ble 5 Mo del τ 1 τ 2 T rue V alues 430 1,125.8 Argus 11 4.5 456.0 Log Linear Mo del: Independence 773.8 1, 774.1 Log Linear Mo del: 2-W ay Inte ractions 470.0 1, 178.1 Smoothing t = 2 | M | = 581 287.1 988.4 Smoothing t = 2 | M | = 1 , 893 471.1 1,240.2 T a ble 6 Mo del τ 1 τ 2 T rue V alues 42 171.2 Argus 20.7 95.4 Log Li near Mo del: Independence 28.8 191.5 Log Li near Mo del: 2-W ay Interactions 35.8 164.1 Smoothing t = 2 | M | = 545 37.1 175.1 Example 7. Population: an extract from the 199 5 Israeli Census. N = 746 , 94 9 , n = 7 , 4 70, K = 42 , 240. Attributes: Sex (2) * Age Gro ups (16 ) * Y e a rs of Study (10) * Number o f Y ears in Israel (11) * Income Gro ups (12). W e applied the smo othing po lynomial of (5) for t = 2 and neighbo rho o ds o b- tained by v arying all attributes exc ept for Sex which was ﬁxed. Neighborho o ds are as in (11) with m = 5, c = 2, d = 6, and | M | = 545. Smaller neig hborho o d did not yield go o d estima tes . The results are g iven in T able 6. Discussion of exampl es The log-linear mo del metho d w as tested in Skinner and Shlomo [11] and references therein, and it see ms to y ie ld go o d res ults for exper - imen ts of the kind done here. Di Consiglio et al. [5] presented exp eriments for individual risk assessment with Ar gus, w hich seems to p erform less well than the log-linear metho d in many of our exp er imen ts with global ris k mea sures. Our new metho d still requires ﬁne- tuning. A t pr esent the res ults seem compar able to the log - linear metho d, and it seems to b e computationa lly somewhat simpler and fa ster. Naturally , more v ariables and spars e da ta sets with a large num ber of cells a re t ypical and need to b e tested. Such ﬁles will cause diﬃculties to any method, and this is where the diﬀerent metho ds should b e co mpared. In sparse multi-w ay tables, mo del s e lection will b e crucial but diﬃcult for the lo g-linear metho d, and p erhaps simpler for the smo othing a pproach. W e also think that our metho d may be ea sier to mo dify to complex s a mpling designs. Our prop osed method is at a preliminary stage and requires more work. Partic- ular directions ar e the following: 1. Adjust the estimates ˆ γ k of (7) to ﬁt known p o pulation margina ls obtained from prior knowledge and sampling weigh ts. In log- linear mode ls the total sum of these estimates corres p o nds to the sample size, but as commented in Section 3 this is not the case with the s mo othing estimates of (7). 2. Use go o dness of ﬁt measures and information on p o pulation marg inals and sam- pling weigh ts to select the type a nd size of the neighborho o ds , and the degr ee of the smo othing p olyno mial in (5). W e hav e o bserved in exper iments that w he n the sum of all estimates matc hes the sa mple size, we obtain go o d risk mea sure estimates , Disclosur e risk estimation 171 and further matching to margina ls may improv e the estimates . 3. Extend the smo othing approa ch to the mor e genera l Negative B inomial mo del which subsumes b oth the Poisson mo del implemented here, and the Negative Bi- nomial discussed in Section 2. 4. Apply this metho d also for individual r isk measure estimates, which a re im- po rtant in themselves, and may also shed more light on eﬃcien t neighbo rho o d and mo del selection. Our preliminary ex per iments suggest that the smo othing appro ach per forms relatively well in estimating individual risk. References [1] Benedetti, R., Capobianchi, A. and Franconi, L. (1 998). Individual risk of disclosur e using sampling des ign infor mation. Contributi Istat 141 2 003. [2] Benedetti, R., Franconi, L. and Piersimoni, F. (19 9 9). Per-record r isk of disclosur e in dep endent data. In Pr o c e e dings of the Confer enc e on Statistic al Data Pr ote ction, Lisb on Mar ch 1998. Europ ean Co mmunities, L ux embourg. [3] Bethlehem, J., Keller, W. and P ann ekoek, J. (19 90). Disclosur e control of micro data . J. Amer. Statist. Asso c. 85 38–45 . [4] Camer on, A . C. and Trivedi, P . K. (1998). R e gr ession Analysis of Count Data . Cambridge Universit y Press. MR1648 274 [5] Di Consigl io, L. , Franconi, L. and Seri, G. (2003 ). Asses s ing indiv idual risk of disclosur e: an exp eriment. In Pr o c e e dings of the Joint ECE/Eur ostat Work Session on Statistic al Data Conﬁdentiality , Luxemburg 28 6–298 . [6] Elamir, E. and Skinner, C. (200 6). Reco rd-level measures of disc losure risk for survey micro da ta. J. Oﬃcial Statist . 22 52 5–53 9. [7] Polettini, S. and Seri, G. (20 03). Guidelines for the protection of so cial micro-data us ing individual risk metho dolo gy—Application with in m u-arg us version 3.2. CASC P ro ject Deliv erable No. 1.2-D3. Av aila ble at http:/ /neon .vb.cbs.nl/casc/ . [8] Rinott, Y. (2003). On mo dels for sta tis tica l disclosure risk estimatio n. I n Pr o c e e dings of the Joint ECE/Eur ostat Work Session on Statistic al Data Con- ﬁdentiality , Luxemburg 27 5–285 . [9] Rinott, Y . and Shlo mo, N. (20 05). A neighborho o d reg ression mo del for sample disclosure risk estimation. In Pr o c e e dings of the Joint UNECE/Eur ostat Work Session on Statistic al Data Conﬁdentiality Genev a, Switzerland. [10] Simonoff, S . J. (1 9 98). Three sides of smo othing: Categor ical data smo oth- ing, nonpara metric regres sion, and density es timation. Internat. St atist. R ev. 66 137– 1 56. [11] Skinner, C. and Shlomo, N . (2 005). Assessing disclosure risk in micro- data using reco rd-level mea sures. In Pr o c e e dings of the Joint UNECE/Eur ost at Work Session on Statistic al Data Conﬁdentiality Genev a, Switzerland. [12] Skinner, C. a nd Ho lmes, D . (19 98). Estimating the re-identiﬁcation risk per record in micro da ta , J. Oﬃcial Statist. 14 361 –372 . [13] Willenborg, L. and de W aal, T. (2001 ). Elements of St atist ic al Disclosur e Contr ol . Springer, New Y ork. MR186 6909 [14] Zhang, C.-H. (2005). Es timation o f sums o f ra ndom v a riables: exa mples and information b ounds. Ann. S tatist. 33 2022– 2041 . MR22 1107 8

A smoothing model for sample disclosure risk estimation

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment