A smoothing model for sample disclosure risk estimation

When a sample frequency table is published, disclosure risk arises when some individuals can be identified on the basis of their values in certain attributes in the table called key variables, and then their values in other attributes may be inferred…

Authors: ** - Yosef Rinott (히브리 대학교, 이스라엘) - Natalie Shlomo (히브리 대학교, 이스라엘 & 사우샘프턴 대학교

IMS Lecture Notes–Monograph Series Complex Datasets and In ve rse Problems: T omogra ph y , Net works and Beyond V ol. 54 (2007) 161–171 c  Institute of Mathematical Statistics, 2007 DOI: 10.1214/0749217 07000000120 A smo othing mo de l for sample disclosu re risk estimatio n Y osef Rinott 1 , ∗ and Natalie Shlomo 2 , † Hebr ew Unive rsity and Southampton University Abstract: When a sample frequency table is pub lished, disclosur e risk arises when some individuals can b e ident ified on the basis of their v alues in certain attributes in the table called key variables , and then their v alues in other attributes may b e inf er red, and their priv acy is violated. On the basis of the sample to b e released, and p ossibl y some partial kno wl- edge of the whole population, an agency which considers r eleasing the sample, has to estim ate the disclosure ri sk. Risk ar ises f rom non-empty sample cells which represent small p opulation cells and from p opulation uniques i n particular. Therefor e ri sk estimation r e- quires assessing how man y of the relev ant population cells are l ikely t o b e small. V arious metho ds hav e b een proposed for this task, and we present a metho d in which estimation of a population cell f requency is based on smoothing using a lo cal neighborho o d of this cell, that i s, cells having si milar or close v alues in all attributes. W e provide some prelimi nary r esults and experim ents wi th this method. Comparisons are made to t wo other metho ds: 1 . a lo g-linea r mo dels approac h in which inf erence on a given cell is based on a “neigh borho o d” of cells deter- mined by the log-linear mo del. Such neigh borho ods hav e one or some common attributes with the cell in question, but some other attributes ma y differ sig- nifican tly . 2 The Ar gus metho d in which i nference on a gi v en cell is based only on the sample frequency in the sp ecific cell, on the sample design and on some kno wn marginal distributions of the p opulation, without learning from an y t ype of “neighborho o d” of the given cell, nor f rom any m o del which uses the structure of the table. 1. In tro duction When a micro data sample file is releas ed by an age nc y , directly iden tifying v ariables, such a s na me, addr ess, etc., are a lwa ys deleted, v ariable v alues ar e often group ed (e.g., Age-Groups instead of pr ecise age), and the da ta is given in the form o f a frequency table. How ever disclosur e risk may still exist, that is, some individuals in the file may b e identified by their combination of v alues in the v ariables a ppe a ring in the data. Samples often c o ntain information on certain v ariables on which the a g ency’s information for the who le p opulation is limited, suc h as e x pe nditur e on sp ecific items in a Household Ex pe nditur e Survey , or detailed information on v ariables such as children’s extra curricula r activities in the So cial Survey of the Israel Cen tral Bureau of Statistics . ∗ Researc h supp orted by the Israel Science F oundation (gran t No. 473/04). † Researc h supp orted in part b y the Israel Science F oundation (gran t No. 473/04). 1 Departmen t of Statistics, Hebrew Universit y , Jerusalem, Israel, e-mail: rinott@huji.ac .il 2 Departmen t of Statistics, Hebrew Unive rsity of Jerusalem, Southampton Statistical Sciences Researc h Institute, Universit y of Southampto n, United Kingdom, e-mail : N.Shlomo@ soton.ac. uk AMS 2000 subje c t classific ations: pr imary 62H17; secondary 62-07. Keywor ds and phr ases: sampl e uniques, neighborho o ds, micro data. 161 162 Y. Rinott and N. Shlomo Often ag encies hav e to assess the dis closure risk inv olv ed in the release of sa mple data in the for m of a frequency table when the corr esp onding po pula tion ta ble may be unknown, or o nly partially k nown. Risk arises from cells in which b oth sa mple and p opulation frequencies a re small, allowing an intruder who has the sample data and access to some information o n the p opulation, a nd in pa rticular on individuals of interest, to identif y such individuals in the sa mple with high probability . Thus, the disclos ure ris k dep ends b oth on the given sa mple, and the p opulatio n. In this pap er we are co ncerned with the issue of estimating dis closure risk inv olved in releasing a sample o n the basis of the sa mple alo ne, assuming the p opulation is unknown. Let f = { f k } denote an m -way frequency table, which is a sa mple from a p op- ulation ta ble F = { F k } , where k = ( k 1 , . . . , k m ) indicates a cell a nd f k and F k denote the frequency in the sample and popula tion cell k , respe c tively . F orma lly , the sample and po pulation sizes in o ur mo dels a re r a ndom and their exp ecta tions are denoted by n a nd N resp ectively , and the num ber of cells by K . W e can ei- ther a ssume that n and N are known, or that they are estimated by their natura l estimators: the actual sample a nd p opulatio n sizes, assumed to b e known. In the sequel when we write n of N we for mally r efer to exp ectatio ns . If the m a ttributes in the table can be consider ed key variables , that is, v ariables which ar e to s ome ex ten t accessible to the public or to p otential intruders, then disclosure risk ar ises from cells in which b o th f k and F k are p ositive and s ma ll, and in par ticular when f k = F k = 1 (sample and po pulation uniques ). Supp ose an int ruder lo cates a sa mple unique in cell k , say , and is aw are of the fact tha t the combination of v a lues k = ( k 1 , . . . , k m ) happ ens to b e unique o r r are in the p op- ulation. If this combination matches an individual of interest to the intruder then ident ification can b e made with high proba bility on the ba sis of the m a ttributes. If the sample co ntains infor mation o n the v alues of other attributes, then these can now b e infer r ed for the individual in ques tion, a nd his priv acy is violated. In many countries this would constitute a violatio n o f law. F or example The Central Bureau of Statistics in Isr ael op era tes under the Statistics Ordinance (1 9 72) which says “No info rmation. . . shall be so [publishe d] as to enable the identification of the per son to who m it r elates”. A glob al risk me asur e quantifies an asp ect of the total r is k in the file by a ggre- gating risk over the individua l cells. F o r simplicity we shall fo cus here only on tw o global measur es, which are based on sa mple uniques: τ 1 = X k I ( f k = 1 , F k = 1) , τ 2 = X k I ( f k = 1) 1 F k , where I denotes the indicator function. Note that τ 1 counts the num ber of sample uniques which ar e also p op ulation uniques , and τ 2 is the exp ected num ber of c o rrect guesses if each sample unique is ma tched to a ra ndo mly chosen individual from the same p opulation cell. Thes e measures are somewhat a rbitrary , a nd one could consider measures which reflect matching of individuals that are not sample uniques, po ssibly with some restrictions on cell s izes. Also, it may make se nse to no rmalize these measures by s ome mea sure of the tota l size o f the table, by the n um be r o f sample uniques, or by some measure of the infor ma tion v alue of the data. V ario us individual and global risk measures have been prop o s ed in the literature, see e.g ., Benedetti et al. [1, 2], Sk inner a nd Holmes [1 2], Elamir and Skinner [6], Rinott [8]. In Section 3 w e pr op ose and e xplain a new metho d of estimation of quantities like τ 1 and τ 2 , using a s ta ndard Poisson mo del, and lo c al smo othing of frequency Disclosur e risk estimation 163 tables. The metho d is based on the idea that one can learn a b o ut a given p opulation cell from neighboring cells, if a suitable definition o f closeness is po ssible, without relying o n complex mo de ling . In Sections 2.1 a nd 2.2 we briefly des crib e tw o known metho ds of es timation of q uantities like τ 1 and τ 2 , and in Section 4 we provide rea l data exp eriments whic h compare the metho ds discuss ed. W e consider the case tha t f is known, and F is a n unknown para meter (on which there may b e some pa rtial informa tio n) and the quantities τ 1 and τ 2 should b e estimated. Note that they are not prop er par ameters, since they inv olve b o th the sample f and the parameter F . The metho ds discuss e d in this pap er consist of mo deling the co nditional distri- bution of F | f , estimating par ameters in this dis tribution and then using estimates of the form (1) ˆ τ 1 = X k I ( f k = 1) ˆ P ( F k = 1 | f k = 1) , ˆ τ 2 = X k I ( f k = 1) ˆ E [ 1 F k | f k = 1] , where ˆ P a nd ˆ E denote es tima tes of the relev ant conditional probability and expec - tation. F or a g eneral theor y of estimates o f this type see Zhang [14] and references therein. Some direct v ar iance e s timates app ear in Rinott [8]. 2. Mo del s F or completenes s we br iefly introduce the Poisson and Negative Binomia l mo dels. More deta ils can b e found, for example, in Bethlehem et al. [3], Cameron and T rivedi [4], Rinott [8]. A common a ssumption in the frequency table litera ture is F k ∼ Poisson( N γ k ), independently , wher e N is ass umed to b e a known para meter, and P γ k = 1. Binomial (or Poisson) sa mpling from F k means tha t f k | F k ∼ B in ( F k , π k ), where each π k is a known co nstant which is part of the sampling design, c a lled the s ampling fraction in cell k . By standa r d ca lculations we then have (2) f k ∼ P oisson( N γ k π k ) and F k | f k ∼ f k + Poisson( N γ k (1 − π k )) , leading to the Poisson model of subsectio n 2 .1 below. Under this mo del the p opulatio n size is rando m with exp ectatio n N , and so is the sample size, with exp ecta tion N P k γ k π k which we denote b y n . In practice we hav e in mind that N and n could b e estimated by the ac tual p opulation and sample sizes , a nd these estimates could b e “plug ged in” where needed. If o ne adds the Bayesian as sumption γ k ∼ Gamma( α, β ) independently , with αβ = 1 /K to e ns ure that E P γ k = 1, then f k ∼ N B ( α, p k = 1 1+ N π k β ), the Negative Binomia l distribution defined for any α > 0 by P ( f k = x ) = Γ( x + α ) Γ( x + 1)Γ( α ) (1 − p k ) x p α k , x = 0 , 1 , 2 , . . . , which for a natural α co un ts the num ber of failur es until α succe sses o ccur in independent Be rnoulli tr ials with probability of success p k . F urther ca lculations yield F k | f k ∼ f k + N B ( α + f k , N π k +1 /β N + 1 /β ), ( F k ≥ f k ). Note that in this mo de l the po pulation size is again rando m with exp ectation N , and now the s ample size has exp ectation N P k π k /K which we denote ag ain by n . As α → 0 (and hence β → ∞ ) w e o bta in F k | f k ∼ f k + N B ( f k , π k ), which is exactly the Negative Binomia l assumption in Section 2.2 b elow. As α → ∞ the 164 Y. Rinott and N. Shlomo Poisson mo del of Sectio n 2.1 is obtained, and in this sense the Neg a tive Binomial with parameter α subsumes b oth mo dels. Next w e discuss tw o metho ds which hav e received muc h attention. They ha ve bee n applied in so me burea us of statistics recently , and ar e b eing tested by o thers. 2.1. The Poisson lo g-line ar metho d Skinner and Holmes [12] a nd E lamir a nd Skinner [6] prop os ed and studied the following approa ch. Assuming a fixed sampling fractio n, that is, π k = π , the fir st part of (2) implies f k ∼ P oisson( nγ k ), whe r e n = N π . Using the sa mple { f k } one can fit a log-linear mo del using standard progra ms, and obtain estimates { ˆ γ k } of the pa rameters. Go o dness of fit meas ures for se lecting mo de ls having go o d r isk estimates were studied in Skinner and Shlomo [1 1]. Using the second part of (2) it is ea sy to compute individual risk me asu r es for cell k , defined by P ( F k = 1 | f k = 1) = e − N γ k (1 − π k ) , (3) E [ 1 F k | f k = 1] = 1 N γ k (1 − π k ) [1 − e − N γ k (1 − π k ) ] . Plugging ˆ γ k for γ k in (3) leads to the desir ed estimates ˆ P ( F k = 1 | f k = 1) and ˆ E [ 1 F k | f k = 1] and then to ˆ τ 1 and ˆ τ 2 of (1). F or ea ch k we ther efore obtain estimates of P ( F k = 1 | f k = 1) and E [ 1 F k | f k = 1] which dep end on ˆ γ k , which in turn dep ends on the frequencies in other cells. F or example, in a lo g-linear mo del o f indep endence, ˆ γ k depe nds on the frequencie s in all cells which hav e a common a ttribute with k . Thus cells that are ra ther differ ent in nature, having v alues which ar e very different fr om those of cell k in most of the attributes, influence the estimates of the pa rameter γ k per taining to this cell. The main goa l of this pa p e r is to study the p ossibility of estimating γ k using cells in more lo ca l “ neighborho o ds ,” having attribute v alues which ar e c lo ser to those of the cell k in cases where clos e ness ca n be defined. 2.2. The Ar gus metho d This metho d, pro po sed by Be nedetti et al. [1, 2], w as orig ina lly oriented towards in- dividual r isk estimation, but was subsequently also a pplied to globa l risk measures, see, e.g, Polettini a nd Seri [7], and Rino tt [8]. Argus has r e c ent ly b een implemented in some Euro p ean sta tistical burea us. In the Argus mo del it is as sumed that F k | f k ∼ f k + N B ( f k , π k ) with an implicit assumption of indep endence betw een c e lls. Since π k are as sumed known we could now calculate P π k ( F k = 1 | f k = 1 ) and E π k [ 1 F k | f k = 1 ]. Ho wev er b ecause of non resp onse, sa mpling biases and err ors, Argus do es not us e the known π k , but rather estimates them fro m the sampling weights as discussed nex t. A t statistics bure a us, each statistica l unit resp onding to a sa mple sur vey is as- signed a sampling weigh t. This weight w i is an infla ting factor that informs on the numb er of units in the p opula tion that ar e r epr esente d by sample unit i , to b e used for infere nc e fro m the sample to the p opulation. It is calculated b y the in verse sampling fraction that is adjusted for non-resp onse or other biases that may o ccur in the s a mpling pro ces s. These adjustments are often ca rried out within p ost-stra ta (w eighting clas s es) defined by known ma rginal distributions of the p opulations, Disclosur e risk estimation 165 such as Age, Sex and Geographical Lo cation. The in verse sampling fractions are calibrated so that the weigh ted sample count in each p ost-s trata is equal to the known po pulation tota l; this calibratio n reduces under o r over r epresentation of the chosen str ata due to any bia s, or sampling error s. The Argus metho d pr ovides initial estimates of the po pulation cell siz es of the form ˆ F k = P i ∈ cell k w i , wher e w i denotes the s ampling weight of indiv idua l i describ ed ab ove (s e e also example b elow). Here is a simple e xample: Suppo se fo r simplicity that the sampling weight s are ba sed o nly on the s ampling design, and on p o st stratifica tio n by a single v ariable, say Sex, and that the s ample is des igned to be a ra ndom subset c o nsisting o f one p ercent of the p opulation a nd therefore we hav e the same sampling fractio n of π = 1 / 10 0 in each Sex gro up. If males, s ay , hav e a non-resp onse rate of 20%, a nd females of 0%, then the sampling weigh t for women in the sample would b e w i = 1 00, and for men w i = 1 00 / 0 . 8 = 125. If in the sample table there is a cell k = ( k 1 , k 2 ) where k 1 stands for Male, and k 2 stands for the level in another attribute, s uch as Income, and f k = 2 0, then in this cell all w i are 125, and ˆ F k = 20 ∗ 125 = 2500. Now supp os e Sex is not one of the v ariables in the table to b e released, but the agency knows it for a ll individuals in the sample. Supp ose the v ar iables in the ta ble are Income and Occupation, and supp o s e now k = ( k 1 , k 2 ), where k 1 stands for a given Inco me group, and k 2 for a given Oc c upation. Supp ose f k = 20, meaning that in the sample ther e are 2 0 individuals with the given income gr oup and o cc upa tion, a nd supp ose that there ar e 10 ma le s and 10 females in this gr oup. The weigh t w i = 100 for the 1 0 females , and 125 for the 1 0 males, and therefor e ˆ F k = 10 ∗ 10 0 + 1 0 ∗ 125 = 225 0. In the above example sampling w eights re flect non r esp onse. In principle a bureau may a rrive a t suc h weight s a lso b ecause in the original sa mpling desig n men are under r epresented, or be c ause it finds o ut that this is the case after p os t s tratifying on Sex and obs erving that males are under represented due to some r easons (some bias, including non-r esp onse, or sampling err or). Returning to Argus, re c all its initial estimates of the p opulation cell size s ˆ F k = P i ∈ cell k w i . Using the rela tio n E π k [ F k | f k ] = f k /π k , the parameters π k are esti- mated using the moment-t ype estimate ˆ π k = f k / ˆ F k . Note that if F k were known, this would b e the usual estimate of the binomial sampling pro ba bility . Straightforw ard calc ulations with the Negative Bino mial distribution show P ˆ π k ( F k = 1 | f k = 1) = ˆ π k and E ˆ π k [ 1 F k | f k = 1] = − ˆ π k 1 − ˆ π k log( ˆ π k ) . Plugging these es timates for ˆ P and ˆ E in (1) we obta in the estimates ˆ τ 1 and ˆ τ 2 of the global risk measure s . Note that in this method the cells a r e treated completely independently , ea ch cell at a time, and the str uctur e o f the table, or relations betw een differ e nt cells play no r ole. Moreover, since this metho d do es not inv olve a mo del which reduces the num ber o f para meters, it is required to estimate esse ntially K para meters, which is typically hard in spa r se tables of the k ind we hav e in mind. 3. Smo othi ng p ol ynomials and l o cal neighborho o ds The estimation ques tio n here is essentially the following: given, say , a sa mple unique, how likely is it to b e a lso a p opulatio n unique, or arise from a small po pulation cell. 166 Y. Rinott and N. Shlomo If a sa mple unique is found in a pa r t of the sample ta ble where neighbo ring cells (b y some reasona ble metric, to be discussed later ) are small or empty , then it s eems reasona ble to b elieve that it is mor e likely to have a risen from a small p o pulation cell. This motiv ates our attempt to study lo cal neighborho o ds, and compar e the results to the type of mo del- dr iven neighborho o d as the log-linea r metho d, and the Argus metho d which uses no neighborho o ds . Consider frequency tables in which some of the attributes ar e ordinal, and define closeness b etw een categories of an attribute in terms of the order , or mor e ge ne r ally , suppo se that for a cer tain a ttribute one can say that some v alues of the attribute are closer to a g iven v alue than others. F or example, Age and Y ears of Education are or dinal attributes, and na tur ally the age o f 5 is closer to 6 than to 7 o r 17, say , while Occupation is not o rdinal, but one can try to define r easona ble notions of closeness b etw een differ ent o ccupatio ns. Classical log- line a r mo dels do not take such closeness into account, a nd therefore, when such models are used for individual cell para meter estimation, the estimates inv olv e data in cells which may be rather remo te fro m the estimated cell. On the other hand, as mentioned a bove, the Ar gus method bases its estimation only on the s a mpling weigh t of the estima ted p o pulation cell. Ther e is no learning from other cells, the structure of the table plays no ro le, a nd each cell’s parameter is estimated separ ately . W e now describ e our prop os ed approach which cons is ts of using lo cal neighbo r- ho o ds of the estima ted cell. Returning to (2 ) we as sume tha t f k ∼ Poisso n( λ k = N γ k π k ). Apart from con- stants, the sample lo g-likelihoo d is P K k =1 [ f k log λ k − λ k ]. H ow ev er if we use a mo del for λ k which is v alid only in some neig hborho o d M o f a given cell, we shall consider the log -likelihoo d of the data in this neighborho o d, that is (4) X k ∈ M [ f k log λ k − λ k ] . F or conv enience of nota tion we now assume that m = 2, that is, we c o nsider t wo-sa y tables; the extension to any m is str aightforw ard. F ollowing Simo noff [10], see also references therein, we use a lo cal smo othing p olyno mial mo del. F or each fixed k = ( k 1 , k 2 ) separately , we write the mo de l b elow for λ k ′ in terms of the parameter s α =( β 0 , β 1 , γ 1 , . . . , β t , γ t ), with k ′ = ( k ′ 1 , k ′ 2 ) v arying in some neighborho o d of k : log λ k ′ ( α ) ≡ log λ ( k ′ 1 ,k ′ 2 ) (5) = β 0 + β 1 ( k ′ 1 − k 1 ) + γ 1 ( k ′ 2 − k 2 ) + · · · + β t ( k ′ 1 − k 1 ) t + γ t ( k ′ 2 − k 2 ) t , for some natur al num ber t . One can hop e that such a po lynomial mo del is v alid with a suitable t fo r k ′ = ( k ′ 1 , k ′ 2 ) in some neighborho o d M of k = ( k 1 , k 2 ). Substituting (5) into (4) we maximize the conc ave function (6) L ( α ) = L ( β 0 , β 1 , γ 1 , . . . , β t , γ t ) = X ( k ′ 1 ,k ′ 2 ) ∈ M [ f ( k ′ 1 ,k ′ 2 ) log λ ( k ′ 1 ,k ′ 2 ) − λ ( q,r ) ] with r esp ect to the co efficients in α o f the regr ession mo del (5). With arg max L ( α ) = ˆ α , and ˆ β 0 denoting its firs t comp onent, w e fina lly obtain our es timate of λ k = λ ( k 1 ,k 2 ) in the form (7) ˆ λ k ≡ λ k ( ˆ α ) = exp( ˆ β 0 ) , Disclosur e risk estimation 167 where the s e cond equality is e x plained by taking k ′ = k = ( k 1 , k 2 ) in (5). T he maximization by the Newton- Raphson metho d is r ather straightforward and fast. Each o f the estimates ˆ λ k requires a separate maximiza tion as ab ove which leads to a v alue ˆ α that dep ends on k = ( k 1 , k 2 ), and a set o f estimates λ k ′ ( ˆ α ), of which only ˆ λ k of (7) is used. F or the r isk measur e discus sed in this paper, it suffices to compute these es timates for cells k which a re sample uniques, tha t is, f k = 1. Equating the par tial deriv ative of the function of (6) with resp ect to β 0 to zero we obtain P k ′ ∈ M λ k ′ ( ˆ α ) = P k ′ ∈ M f k ′ , and other deriv a tives y ield moment identities. Note, how ever, that thes e desir able identities hold for λ k ′ ( ˆ α ) which ar e obta ined for a fixed k = ( k 1 , k 2 ), and not for our final e stimates in (7), which are the one s we use in the sequel. With the estimate of (7), re c a lling λ k = N γ k π k and setting U = { k : f k = 1 } , the set of sample uniques, we now apply the Poisson formulas (3), see also (1), to obtain the r isk es timates (8) ˆ τ 1 = X k ∈ U e − ˆ λ k (1 − π k ) /π k , ˆ τ 2 = X k ∈ U 1 ˆ λ k (1 − π k ) /π k [1 − e − ˆ λ k (1 − π k ) /π k ] . In our exp eriments we defined neighbo r ho o ds M of k by v a rying aro und k co- ordinates co rresp onding to attributes that a re ordina l, and using clo se v a lues in non-ordinal attributes when p ossible (e.g., in O ccupation). Attributes in which closeness of v alues ca nnot b e defined rema in co nstant in the whole neighbo rho o d. Thu s in our ex pe riments, neighborho o ds alwa ys co ns ist of individuals of the same Sex. F o r more details see Section 4. 4. Exp eriments with nei gh b orho o ds W e pr esent a few exp eriments. They are preliminary as alr eady men tioned and more work is needed o n the approach itself and on classifying types of da ta for which it might work. In the exp er iment s we used our own versions o f the Argus and log-linea r mo d- els metho ds, prog rammed on the SAS system. Throughout o ur exp eriments tw o log-linear models ar e consider e d, one of independence of all a ttributes, the o ther including all tw o-wa y interactions. The weight s w i for the Ar gus metho d in a ll our exa mples were computed by po st-stratifica tio n on Sex b y Age b y Geo graphica l lo cation (the latter is not one of the attributes in an y o f the ta bles, but it was used for post- s tratification). These v ar iables ar e commonly used fo r p os t-stratification, other str ata may give different , and p erhaps b etter results. In all exp eriments we to ok a rea l p opulation data file of siz e N given in the form of a contingency table with K cells, a nd from it we to o k a simple r a ndom sample of size n . Since the p opulation and the sample are known to us , we ca n compute the true values of τ 1 and τ 2 and their estimates by the differe n t metho ds, and c ompare. Example 1. In this small example the po pulation consists of a small ex tr act from the 19 95 Isra eli Census with individuals of age 1 5 and ov er, w ith N = 1 5 , 035 and K = 448. F r om this p opula tion we to o k a ra ndom sample of size n = 1 , 504, using a fixed sampling fraction, that is π k = n/ N for all k . The sampling fraction is consta nt in all our exp eriments. The attributes (with nu mber of levels in parentheses) were Age Groups (32 ), and Income Groups (14), b oth ordinal. As mentioned ab ov e, throughout o ur exp eriments tw o log-linear mo dels are con- sidered, one o f indep e ndenc e , the other including all tw o-wa y interactions (which 168 Y. Rinott and N. Shlomo T a ble 1 Example 1 Example 2 Mo del τ 1 τ 2 τ 1 τ 2 T rue V alues 2 12.4 2 19.9 Argus 7.8 19.6 14.7 37.2 Log Linear Mo del: Independence 0.06 6.7 0.01 9.8 Log Linear Mo del: 2-W ay Inte ractions 0.01 8.6 1.4 19.6 Smoothing t = 1 | M | = 49 3.2 12.0 7.0 22.5 Smoothing t = 2 | M | = 49 1.7 10.4 4.8 19.0 in the present ca se of tw o attr ibutes , is a saturated mo del). In this exp eriment we tried our prop o sed s mo othing p olyno mia l a pproach of (5) for t = 1 , 2. W e co ns id- ered one type of neighborho o d here, constructed by ch anging each attribute v a lue in k by at mo st 3 v alues up or down, that is , the neighborho o d of ea ch cell k is (9) M = { k ′ : max 1 ≤ i ≤ m | k ′ i − k i | ≤ c } , with m = 2, c = 3 a nd hence size | M | = 49 . F or cells near the b ounda r ies some of the cells in their neig hbo rho o ds do not exist; here we set non-exis ting cells’ frequencies to b e zero, but other p os s ibilities can be considered. T able 1 presents the true τ v a lues a nd their estimates by the metho ds describ ed ab ov e. Example 2. The p o pulation consis ts of an ex tract from the 19 9 5 Isr aeli Census, N = 37 , 586 , n = 3 , 759, a nd K = 896. The attributes are Sex (2 ) * Age Groups (32) * Income Gro ups(14). W e applied the smo othing p olynomial of (5) for t = 1 , 2 a nd neig hborho o ds obtained b y v ar ying the attributes of Age and Income as in Example 1 and keeping Sex fixed. In other words we used the neig h b orho o ds (10) M = { k ′ : k ′ 1 = k 1 , max 2 ≤ i ≤ m | k ′ i − k i | ≤ c } , with m = 3, c = 3 w hich are like (9) on each s ub-table of males and females. The results are given in T able 1 . Example 3. Population: an extr a ct from the 1995 Israeli Census. N = 37 , 586 , n = 3 , 759, K = 11 , 64 8. A ttributes: Sex(2) * Age Gr oups (32) * Income Gro ups(14) * Y e a rs of Study (13). W e applied the smo othing po lynomial of (5) for t = 2 and neighbo rho o ds o b- tained by fixing Sex, so neighbor ho o ds are as in (10), but with m = 4 , c = 2 , T a ble 2 Mo del τ 1 τ 2 T rue V alues 187 452.0 Argus 137.2 346.4 Log Linear Mo del: Independence 217.3 518.0 Log Linear Mo del: 2-W ay Inte ractions 167.2 432.8 Smoothing t = 2 | M | = 125 170.7 44 7.9 Disclosur e risk estimation 169 T a ble 3 Mo del τ 1 τ 2 T rue V alues 191 568.0 Argus 79.2 315.6 Log Linear Mo del: Independence 364.8 862.3 Log Linear Mo del: 2-W ay Interactions 182.2 546.2 Smoothing t = 2 | M | = 545 139.6 509.1 Smoothing t = 2 | M | = 625 154.7 528.5 Smoothing t = 2 | M | = 1025 215.7 647.2 T a ble 4 Mo del τ 1 τ 2 T rue V alues 5 36.9 Argus 7.7 35 .5 Log Linear Mo del: Independence 6.4 44 .2 Log Linear Mo del: 2-W ay Inte ractions 1.1 26 .4 Smoothing t = 2 | M | = 125 3.3 31.3 and since we now v a r y three v aria bles, ea ch ov er a range of fiv e v alues, we have | M | = 12 5 . The results are given in T able 2. Example 4. Population: an extract from the 2001 UK Census File. N = 94 4 , 793 , n = 1 8 , 896 , K = 15 2 , 100. Attributes: Sex (2) * Age Groups (25 ) * Number of Persons in Ho usehold (9) * Educa tion Qua lifications (13) * Occupa tion (26). W e applied the smo othing p olyno mial of (5) fo r t = 2 and neighborho o ds defined by fixing Sex and v ar ying all other v ariables, including Occupation, which was co ded as ordinal. The neighborho o ds are (11) M = { k ′ : k ′ 1 = k 1 , max 2 ≤ i ≤ m | k ′ i − k i | ≤ c, X i | k ′ i − k i | ≤ d } , with m = 5, c = 2 and d = 6 , 8, resulting in neighborho o d sizes | M | = 545 and 625, resp ectively . W e also tried c = 3 , d = 6 a nd hence | M | = 1 025. The results are given in T able 3. Example 5. P opulation: an extr act from the 1995 Israeli Cens us . N = 24 8 , 983, n = 2 , 4 9 0, K = 8 , 80 0. Att ributes: Sex(2)* Age Gro ups(16) * Y ear s of Study (25) * Occupa tion (11 ) . W e applied the smo othing po lynomial of (5) for t = 2 and neighbo rho o ds o b- tained b y v ar ying three attributes and fixing Sex s o neighborho o ds as in (10) with m = 4 , c = 2, and | M | = 125 . The r esults are given in T able 4. Example 6. Population: an extract from the 199 5 Israeli Census. N = 746 , 94 9 , n = 1 4 , 939, K = 33 7 , 920. Attributes: Sex (2) * Age Groups (1 6 ) * Y ears o f Study (10) * Number of Y ears in Isra el (11) * Income Groups (12) * Number of Persons in Household (8). Note that this is a very s parse table. W e applied the smo othing po lynomial of (5) for t = 2 and neighbo rho o ds o b- tained by v arying all attributes exc ept for Sex which was fixed. Neighborho o ds are as in (11) with m = 6, c = 2, d = 4 , 6, and | M | = 58 1 and 1 , 893, resp ectively . The results are given in T able 5 . 170 Y. Rinott and N. Shlomo T a ble 5 Mo del τ 1 τ 2 T rue V alues 430 1,125.8 Argus 11 4.5 456.0 Log Linear Mo del: Independence 773.8 1, 774.1 Log Linear Mo del: 2-W ay Inte ractions 470.0 1, 178.1 Smoothing t = 2 | M | = 581 287.1 988.4 Smoothing t = 2 | M | = 1 , 893 471.1 1,240.2 T a ble 6 Mo del τ 1 τ 2 T rue V alues 42 171.2 Argus 20.7 95.4 Log Li near Mo del: Independence 28.8 191.5 Log Li near Mo del: 2-W ay Interactions 35.8 164.1 Smoothing t = 2 | M | = 545 37.1 175.1 Example 7. Population: an extract from the 199 5 Israeli Census. N = 746 , 94 9 , n = 7 , 4 70, K = 42 , 240. Attributes: Sex (2) * Age Gro ups (16 ) * Y e a rs of Study (10) * Number o f Y ears in Israel (11) * Income Gro ups (12). W e applied the smo othing po lynomial of (5) for t = 2 and neighbo rho o ds o b- tained by v arying all attributes exc ept for Sex which was fixed. Neighborho o ds are as in (11) with m = 5, c = 2, d = 6, and | M | = 545. Smaller neig hborho o d did not yield go o d estima tes . The results are g iven in T able 6. Discussion of exampl es The log-linear mo del metho d w as tested in Skinner and Shlomo [11] and references therein, and it see ms to y ie ld go o d res ults for exper - imen ts of the kind done here. Di Consiglio et al. [5] presented exp eriments for individual risk assessment with Ar gus, w hich seems to p erform less well than the log-linear metho d in many of our exp er imen ts with global ris k mea sures. Our new metho d still requires fine- tuning. A t pr esent the res ults seem compar able to the log - linear metho d, and it seems to b e computationa lly somewhat simpler and fa ster. Naturally , more v ariables and spars e da ta sets with a large num ber of cells a re t ypical and need to b e tested. Such files will cause difficulties to any method, and this is where the different metho ds should b e co mpared. In sparse multi-w ay tables, mo del s e lection will b e crucial but difficult for the lo g-linear metho d, and p erhaps simpler for the smo othing a pproach. W e also think that our metho d may be ea sier to mo dify to complex s a mpling designs. Our prop osed method is at a preliminary stage and requires more work. Partic- ular directions ar e the following: 1. Adjust the estimates ˆ γ k of (7) to fit known p o pulation margina ls obtained from prior knowledge and sampling weigh ts. In log- linear mode ls the total sum of these estimates corres p o nds to the sample size, but as commented in Section 3 this is not the case with the s mo othing estimates of (7). 2. Use go o dness of fit measures and information on p o pulation marg inals and sam- pling weigh ts to select the type a nd size of the neighborho o ds , and the degr ee of the smo othing p olyno mial in (5). W e hav e o bserved in exper iments that w he n the sum of all estimates matc hes the sa mple size, we obtain go o d risk mea sure estimates , Disclosur e risk estimation 171 and further matching to margina ls may improv e the estimates . 3. Extend the smo othing approa ch to the mor e genera l Negative B inomial mo del which subsumes b oth the Poisson mo del implemented here, and the Negative Bi- nomial discussed in Section 2. 4. Apply this metho d also for individual r isk measure estimates, which a re im- po rtant in themselves, and may also shed more light on efficien t neighbo rho o d and mo del selection. Our preliminary ex per iments suggest that the smo othing appro ach per forms relatively well in estimating individual risk. References [1] Benedetti, R., Capobianchi, A. and Franconi, L. (1 998). Individual risk of disclosur e using sampling des ign infor mation. Contributi Istat 141 2 003. [2] Benedetti, R., Franconi, L. and Piersimoni, F. (19 9 9). Per-record r isk of disclosur e in dep endent data. In Pr o c e e dings of the Confer enc e on Statistic al Data Pr ote ction, Lisb on Mar ch 1998. Europ ean Co mmunities, L ux embourg. [3] Bethlehem, J., Keller, W. and P ann ekoek, J. (19 90). Disclosur e control of micro data . J. Amer. Statist. Asso c. 85 38–45 . [4] Camer on, A . C. and Trivedi, P . K. (1998). R e gr ession Analysis of Count Data . Cambridge Universit y Press. MR1648 274 [5] Di Consigl io, L. , Franconi, L. and Seri, G. (2003 ). Asses s ing indiv idual risk of disclosur e: an exp eriment. In Pr o c e e dings of the Joint ECE/Eur ostat Work Session on Statistic al Data Confidentiality , Luxemburg 28 6–298 . [6] Elamir, E. and Skinner, C. (200 6). Reco rd-level measures of disc losure risk for survey micro da ta. J. Official Statist . 22 52 5–53 9. [7] Polettini, S. and Seri, G. (20 03). Guidelines for the protection of so cial micro-data us ing individual risk metho dolo gy—Application with in m u-arg us version 3.2. CASC P ro ject Deliv erable No. 1.2-D3. Av aila ble at http:/ /neon .vb.cbs.nl/casc/ . [8] Rinott, Y. (2003). On mo dels for sta tis tica l disclosure risk estimatio n. I n Pr o c e e dings of the Joint ECE/Eur ostat Work Session on Statistic al Data Con- fidentiality , Luxemburg 27 5–285 . [9] Rinott, Y . and Shlo mo, N. (20 05). A neighborho o d reg ression mo del for sample disclosure risk estimation. In Pr o c e e dings of the Joint UNECE/Eur ostat Work Session on Statistic al Data Confidentiality Genev a, Switzerland. [10] Simonoff, S . J. (1 9 98). Three sides of smo othing: Categor ical data smo oth- ing, nonpara metric regres sion, and density es timation. Internat. St atist. R ev. 66 137– 1 56. [11] Skinner, C. and Shlomo, N . (2 005). Assessing disclosure risk in micro- data using reco rd-level mea sures. In Pr o c e e dings of the Joint UNECE/Eur ost at Work Session on Statistic al Data Confidentiality Genev a, Switzerland. [12] Skinner, C. a nd Ho lmes, D . (19 98). Estimating the re-identification risk per record in micro da ta , J. Official Statist. 14 361 –372 . [13] Willenborg, L. and de W aal, T. (2001 ). Elements of St atist ic al Disclosur e Contr ol . Springer, New Y ork. MR186 6909 [14] Zhang, C.-H. (2005). Es timation o f sums o f ra ndom v a riables: exa mples and information b ounds. Ann. S tatist. 33 2022– 2041 . MR22 1107 8

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment