Fundamental Limits on Data Acquisition: Trade-offs between Sample Complexity and Query Difficulty

1 Fundamental Limits on Data Acquisition: T rade-of fs between Sample Comple xity and Query Dif ﬁculty Hye W on Chung ∗ , Ji Oon Lee, and Alfred O. Hero Abstract W e consider query-based data acquisition and the corresponding information recovery problem, where the goal is to recov er k binary variables (information bits) from parity measurements of those variables. The queries and the corresponding parity measurements are designed using the encoding rule of Fountain codes. By using Fountain codes, we can design potentially limitless number of queries, and corresponding parity measurements, and guarantee that the original k information bits can be recov ered with high probability from any sufﬁciently lar ge set of measurements of size n . In the query design, the av erage number of information bits that is associated with one parity measurement is called query dif ﬁculty ( ¯ d ) and the minimum number of measurements required to recover the k information bits for a ﬁxed ¯ d is called sample complexity ( n ). W e analyze the fundamental trade-of fs between the query difﬁculty and the sample complexity , and show that the sample complexity of n = c max { k , ( k log k ) / ¯ d } for some constant c > 0 is necessary and sufﬁcient to recover k information bits with high probability as k → ∞ . Index T erms Sample complexity , query difﬁculty , Fountain codes, Soliton distribution, crowdsourcing. I . I N T R O D U C T I O N Query-based data acquisition arises in diverse applications including: crowdsourcing [1], [2]; activ e learning [3], [4]; experimental design [5], [6]; and community recov ery or clustering in graphs [7], [8]. In these applications, query-based data acquisition can be modeled as a 20 questions problem [9] between an oracle (or oracles) and a player where the oracle knows the values of information bits that the player aims to recover , while the player designs queries to the oracle and receives answers from the oracle. In this paper , we consider query-based data acquisition with the goal of recovering the v alues of k variables ( x 1 , . . . , x k ) (information bits). When we assume that the v ariables and the answers (measurements) are binary , we can consider a parity check sum as a type of measurement, which corresponds to exclusi ve or (XOR or modulo 2 sum) of some subset of the information bits. Querying parity symbols of the information bits generalizes the 20 questions model of [9], [10]. This generalization is the focus of this paper and has a wide range of applications, in particular to crowdsourcing systems [1]. Consider a crowdsourcing system consisting of a number of workers and a particular task that they are expected to work on. Assume that the task is to classify a collection of k images into two exclusi ve groups, e.g., whether or not an image is suitable for children. A worker (oracle) in the system is giv en a query about some subset of the images and asked to provide a binary answer re garding those images. Assume that the worker can skip the query if the worker is unsure of the answer . The probability that one worker skips a query is unkno wn at the stage of query design and it can be different for each of the work ers in the system, depending on their abilities or ef forts. Therefore, from the query designer’ s point of view , it is natural to assume that only a random subset of the designed queries will be answered by the workers in the crowdsourcing system. The query designer’ s objectiv e is to design queries such that when the receiv ed number of answers exceeds some threshold, regardless of which subset of the answers was collected, the original k binary bits can be recovered with high probability . In this paper we show that Fountain codes [11] are naturally suited to this crowdsourcing query design problem. Fountain codes are a type of forward error correcting codes suitable for binary erasure channels (BEC) with unkno wn erasure probabilities. This type of codes has been the subject of much research for reliable internet packet transmissions when the packets transmitted from the source are randomly lost before they arrive at the destination. Hye W on Chung ∗ (hwchung@kaist.ac.kr) is with the School of Electrical Engineering at KAIST in South Korea. Ji Oon Lee (jioon.lee@kaist.edu) is with the Department of Mathematical Sciences at KAIST in South Korea. Alfred O. Hero (hero@eecs.umich.edu) is with the Department of EECS at the Univ ersity of Michigan. This work was partially supported by National Research Foundation of K orea under grant number 2017R1E1A1A01076340, and by United States Army Research Ofﬁce under grant W911NF-15-1-0479. 2 For k input symbols ( x 1 , x 2 , . . . , x k ) , Fountain codes produce a potentially limitless number of parity measurements, which are also called output symbols. By using well-designed F ountain codes, one can guarantee that, giv en any set of output symbols of size k (1 + δ ) with small ov erhead δ > 0 , the input symbols can be reco vered with high probability . Examples of Fountain codes are L T -codes [12] and Raptor codes [13]. By using these Fountain codes, we can design potentially limitless number of queries (and corresponding parity measurements) with the desired properties suitable for the crowdsourcing example. Ho wev er, the Fountain code framework must be extended in order to account for the worker’ s limited capacity to answer dif ﬁcult queries. The query difﬁculty is deﬁned as the average number of input symbols required to compute a single parity measurement. The query difﬁculty is related to the encoding complexity for one-stage encoding, but it is dif ferent from the encoding complexity when the encoding is done in multiple stages. The query dif ﬁculty represents the number of input symbols on av erage the worker must kno w to calculate one parity measurement. Depending on the query difﬁculty , the number of answers (parity measurements) required to recover k input symbols may vary greatly . W e call the minimum number of measurements required to recov er k input symbols the sample complexity . The sample comple xity is a function of the query dif ﬁculty as well as the number k of input symbols to be recovered. Let us consider two extreme cases. First, consider the case when the query difﬁculty is equal to 1. More speciﬁcally , we assume that each query asks the v alue of only one variable in ( x 1 , x 2 , . . . , x k ) at a time. Since it is not known which of the queries would be answered by the workers at the stage of query design, a set of queries is designed by uniformly and randomly picking one variable in ( x 1 , x 2 , . . . , x k ) at a time. For such querying scenarios, with randomly selected n measurements, in order to recover the k information bits with error probability less than 1 /k u for some constant u > 0 , the required number n of queries scales as k log k . On the other hand, when each query is designed to generate a parity measurement of randomly selected k/ 2 bits at a time, the required number n of measurements is only k + c log k for some constant c > 0 [13]. Therefore, in these two extreme cases, we can observe that when the query difﬁculty is equal to 1, the sample complexity scales as k log k , whereas for the query dif ﬁculty of order k the sample complexity scales as k . Then the question is ho w the sample complexity scales as the query difﬁculty increases from 1 to Θ( k ) . In this paper , we aim to analyze the fundamental trade-offs between the sample complexity and the query dif ﬁculty in recov ering k information bits. There have been papers that hav e analyzed such trade-offs when it is assumed that the parity measurements in volv e only a ﬁxed number 1 ≤ d ≤ k of input symbols. In [14], the case of pairwise measurements ( d = 2 ) was considered and in [15], a general integer 1 ≤ d ≤ k was considered. Note that there are a total of  k d  possible parity measurements for a ﬁxed d . In both papers, it was assumed that each measurement is independently observed with probability p obs . It was shown that the number n of measurements to recover k input symbols with high probability scales as n = p obs  k d  = c max { k , ( k log k ) /d } for some constant c > 0 . In this paper , we generalize the work in [15] by not ﬁxing the number d but instead allo wing that the number d follows a distrib ution (Ω 0 , Ω 1 , . . . , Ω k ) where Ω d denotes the probability that the v alue d is chosen, where P k d =0 Ω d = 1 . W e consider the average query dif ﬁculty ¯ d = P k d =0 d Ω d and analyze the sample complexity n as a function of the query dif ﬁculty ¯ d and the number k of input symbols. By assuming that d follows the prescribed distribution, we can generate potentially limitless parity measurements by using the encoding rule employed by Fountain codes; guaranteeing that for any set of ﬁxed number of measurements it is possible to recover the k input symbols with high probability . This frame work is thus more suitable for the situations where the parity measurements are erased with arbitrary (unknown) probabilities and thus it is required to have the ability to generate potentially limitless number of queries and corresponding parity measurements. For the pre vious framew orks in [14], [15], the maximum number of parity measurements is restricted to  k d  for a ﬁxed d . Our main contribution in this paper is to specify the fundamental trade-offs between the sample complexity ( n ) and the query difﬁculty ( ¯ d ) in this generalized measurement model. W e sho w that the sample complexity n necessary and sufﬁcient to recover k input symbols with high probability scales as n = c · max  k , ( k log k ) / ¯ d  (1) for some constant c > 0 . Note that for ¯ d = O (log k ) , the sample complexity n is in versely proportional to the query dif ﬁculty ¯ d . In particular , when query difﬁculty ¯ d = Θ(1) , the sample complexity scales as k log k , whereas when ¯ d = Θ(log k ) , the sample complexity scales as k . 3 The rest of this paper is organized as follo ws. In Section II, we explain the encoding rule of Fountain codes (ho w to generate potentially limitless number of parity measurements) and state the main problem of this paper . In Section III, we provide the main results, showing the fundamental trade-offs between the sample comple xity and the query difﬁculty . In Section IV, we prov e the main theorem. More technical details for the proof are presented in Appendices. In Section V, some simulation results are pro vided, which further support our theoretical results. In Section VI, we provide conclusions and discuss possible future research directions. A. Notations W e use the notation ⊕ for XOR of binary variables, i.e., for a, b ∈ { 0 , 1 } , a ⊕ b = 0 iff a = b and a ⊕ b = 1 iff a 6 = b . W e denote by e j the k -dimensional unit vector with its j -th element equal to 1. For a vector x , k x k 1 denotes the number of 1 ’ s in the vector x . F or vectors x and y , the inner product between x and y is denoted by x · y . For two integers α and β , we use the notation α ≡ β to indicate that mo d( α, 2) = mo d( β , 2) . For two vectors x = ( x 1 , x 2 , . . . , x k ) and y = ( y 1 , y 2 , . . . , y k ) , when we write x ≡ y , it means that mo d( x i , 2) = mo d( y i , 2) for all i ∈ { 1 , 2 , . . . , k } . W e use the O ( · ) and Θ( · ) notations to describe the asymptotics of real sequences { a k } and { b k } : a k = O ( b k ) implies that a k ≤ M b k for some positi ve real number M for all k ≥ k 0 ; a k = Θ( b k ) implies that a k ≤ M b k and a k ≥ M 0 b k for some positiv e real numbers M and M 0 for all k ≥ k 0 0 . The logarithmic function log is with base e . I I . M O D E L A N D P RO B L E M S T A T E M E N T Consider a k -dimensional binary random vector x = ( X 1 , X 2 , . . . , X k ) T , which is uniformly and randomly distributed ov er { 0 , 1 } k . W e call X 1 , X 2 , . . . , X k the input symbols. W e aim to learn the v alues of ( X 1 , X 2 , . . . , X k ) by observing a total of n parity measurements of different subsets of those k bits. Consider k -dimensional binary vectors v i = ( v i 1 , v i 2 , . . . , v ik ) , i = 1 , . . . , n . The parity measurement associated with the vector v i is deﬁned by Y i = mo d   k X j =1 v ij X j , 2   = v i 1 X 1 ⊕ · · · ⊕ v ik X k , (2) for i = 1 , . . . , n . W e call such parity measurements ( Y 1 , Y 2 , . . . , Y n ) the output symbols. Each v i ∈ { 0 , 1 } k determines which subset of ( X 1 , X 2 , . . . , X k ) is to be picked in calculating the i -th parity measurement. The process of designing { v i } is called query design or encoding. W e use Fountain codes, also known as erasure rateless codes, for the encoding. Let (Ω 0 , Ω 1 , . . . , Ω k ) be a distribution on { 0 , 1 , . . . , k } where Ω d denotes the probability that the value d is chosen and P k d =0 Ω d = 1 . In the encoding of F ountain codes, each v ector v i is generated independently and randomly by ﬁrst sampling a weight d ∈ { 0 , 1 , . . . , k } from the distribution (Ω 0 , . . . , Ω k ) and then selecting a k -dimension vector of weight d uniformly at random from all the vectors of { 0 , 1 } k with weight d . Consider an arbitrary set of n output symbols ( Y 1 , . . . , Y n ) generated by the above encoding rule. The relationship between the k input symbols and the n output symbols can be depicted by a bipartite graph with k input nodes on one side and n output nodes on the other side as shown in Fig 1. Denote by ¯ d the av erage degree of the output nodes, ¯ d = k X d =1 d · Ω d . (3) This number ¯ d indicates the average number of input symbols in volved in one parity measurement (output symbol) and is related to the dif ﬁculty in calculating one parity measurement. W e call this number query dif ﬁculty . The process of recovering the k input symbols from the n output symbols is called information recov ery or decoding. Denote by ˆ x ( Y k 1 ) the estimate of x gi ven Y n 1 and deﬁne the probability of error as P ( k ) e = min ˆ x ( · ) Pr( ˆ x ( Y k 1 ) 6 = x ) . (4) W ith the proper choice of the distribution (Ω 0 , . . . , Ω k ) , the Fountain codes guarantee that P ( k ) e → 0 as k → ∞ with n larger than some threshold. The minimum number of n required to guarantee P ( k ) e → 0 as k → ∞ , minimized ov er all (Ω 0 , . . . , Ω k ) for a ﬁxed k and ¯ d , is called sample comple xity . W e aim to ﬁnd the fundamental limits on n to guarantee reliable information reco very of k input symbols for a ﬁx ed query difﬁculty ¯ d . 4 . . . x 1 x 2 x 3 x 4 x 5 x 6 y 1 y 2 y 3 y 4 y 5 y 6 y 7 Input nodes Output nodes . . . y n x k Fig. 1. Bipartite graph between input nodes and output nodes. I I I . M A I N R E S U LT S : F U N DA M E N TA L T R A D E - O FF S B E T W E E N S A M P L E C O M P L E X I T Y A N D Q U E RY D I FFI C U L T Y In this section, we state our main results that the sample complexity n that is necessary and sufﬁcient to make P ( k ) e → 0 as k → ∞ scales in terms of k and ¯ d as n = c · max { k , ( k log k ) / ¯ d } for some constant c > 0 independent of k and ¯ d , when the parity measurements are generated by the encoding rule of Fountain codes as explained in Section II. W e ﬁrst state the well-kno wn lower bound on the sample complexity n of F ountain codes presented in [13]. Pr oposition 1: T o r eliably r ecover k input symbols with P ( k ) e ≤ 1 /k u for some constant u > 0 fr om parity measur ements g enerated by F ountain codes, it is necessary that n ≥ c l max  k , k log k ¯ d  (5) for some constant c l > 0 . Pr oof: Sho wing the ﬁrst condition n ≥ c l k is straightforward. Each output symbol Y i = mo d  P k j =1 v ij X j , 2  represents a linear equation of k unknown input symbols ( X 1 , X 2 , . . . , X k ) . Since there are k unknowns, it is necessary to have at least n = k linear equations to solve this linear system reliably . The second condition n ≥ ( c l k log k ) / ¯ d is from a property of random graphs. In the bipartite graph between input nodes and output nodes, we say that an input node is isolated if it is not connected to any of the output nodes. W e analyze the probability that an input node is isolated when the edges are designed by the encoding rule of F ountain codes. The error probability P ( k ) e is bounded below by the probability that an input node is isolated, since when an input node is isolated the decoding error happens. Consider an output node with degree d . The probability that an input node is not connected to this output node of de gree d equals 1 − d/k . Since an output node has de gree d with probability Ω d , the probability that an input node is not connected to an output node equals k X d =0 Ω d (1 − d/k ) = 1 − ¯ d/k . (6) Since there are n output nodes and these output nodes are sampled independently , the probability that an input node is isolated (not connected to any of those output nodes) equals  1 − ¯ d k  n . (7) By the mean value theorem, we can sho w that (1 − ¯ d/k ) n ≥ e − α/ (1 − α/n ) where α = n ¯ d/k . Since the decoding error probability P ( k ) e is lower bounded by the probability that an input node is isolated, to satisfy P ( k ) e ≤ 1 /k u 5 for some constant u > 0 , it is necessary that e − α/ (1 − α/n ) ≤ 1 /k u , which is equi valent to α ≥ log k · u 1 + ( u log k ) /n ≥ log k · u 1 + ( u log k ) /k ≥ log k · u 1 + ( u log 3) / 3 ≥ c l log k (8) for some constant c l > 0 . By plugging in α = n ¯ d/k , we get n ¯ d ≥ c l k log k . (9) The main contribution of this paper is showing that the bound in (5) is indeed achiev able (up to constant scaling) by properly designed Fountain codes for any ¯ d from Θ(1) to Θ(log k ) . W e provide a particular output degree distribution (Ω 0 , Ω 1 , . . . , Ω k ) for which we can control the query difﬁculty ¯ d from Θ(1) to Θ(log k ) and sho w that it is possible to reliably reco ver k information bits with sample complexity obeying n = c u max  k , k log k ¯ d  (10) for some constant c u > 0 . Therefore, by combining (10) with (5), we conclude that n = c · max { k , ( k log k ) / ¯ d } is necessary and sufﬁcient for reliable recovery of k information bits when ¯ d is the a verage query dif ﬁculty . Suppose that the la w of Ω d is giv en by an ideal Soliton distrib ution Ω d =      1 D if d = 1 1 d ( d − 1) if 2 ≤ d ≤ D 0 if d > D or d = 0 , (11) for some D ∈ { 2 , 3 , . . . , k } . Here, for simplicity , we assume that k ≥ 3 . Note that the query dif ﬁculty scales as log D since log( D + 1) < ¯ d = 1 D + D X d =2 1 d − 1 = D X d =1 1 d < log D + 1 . Therefore, as D increases from 2 to k , the query dif ﬁculty ¯ d scales from log 3 to log k . Theor em 1: F or the Soliton distribution (11) with D ∈ { 2 , 3 , . . . , k } , the k input symbols can be r eliably r ecover ed, i.e., P ( k ) e → 0 as k → ∞ , with sample complexity n = c u · max  k , k log k ¯ d  (12) for some constant c u > 0 . The proof of Theorem 1 will be presented in Section IV. Theorem 1 states that for query difﬁculty ¯ d = O (log k ) , the sample complexity n to reliably recover k input symbols is inv ersely proportional to the query difﬁculty ¯ d . When the query difﬁculty does not increase in k , i.e., ¯ d = Θ(1) , it is necessary and sufﬁcient to hav e n = Θ( k log k ) to reliably recover the k information bits. In this regime, the ratio between k and n con verges to 0 as k → ∞ . On the other hand, when we increase the query dif ﬁculty to ¯ d = Θ(log k ) , it is enough to hav e n = Θ( k ) samples, which results in a positiv e limit of k /n as k → ∞ . When ¯ d  log k , increasing the query dif ﬁculty no longer helps in reducing the sample complexity . By using the Soliton distribution (11) and the encoding rule of Fountain codes, we can design potentially limitless number of queries about ( x 1 , x 2 , . . . , x k ) and the corresponding parity measurements. Theorem 1 shows that with any set of measurements of size n no larger than (12), we can reliably recov er the k information bits as k → ∞ . Moreov er , this sample size is optimal up to constants as shown by Proposition 1. Thus, our results provide the optimal query design strategy for reliable information recovery from an arbitrary set of parity measurements, optimal in terms of the sample comple xity (up to constants) for a ﬁxed query difﬁculty . 6 I V . P RO O F O F T H E O R E M 1 In this section, we prov e Theorem 1 by providing an upper bound on P ( k ) e and sho wing that the sample complexity n suf ﬁcient to make this upper bound conv erge to 0 as k → ∞ is equal to c u · max { k , ( k log k ) / ¯ d } for some constant c u > 0 . Consider P ( k ) e deﬁned in (4). The optimal decoding rule ˆ x ( · ) that minimizes the probability of error is the maximum likelihood (ML) decoding for the uniformly distributed input symbols. Assume that we collect n parity measurements ( Y 1 , . . . , Y n ) each of which equals Y i = mo d  P n j =1 v ij X j , 2  . Consider a matrix A whose i -th row is v i = ( v i 1 , v i 2 , . . . , v ik ) , i.e., A := [ v 1 ; v 2 ; . . . ; v n ] . (13) W e call A a sampling matrix. Giv en ( Y 1 , . . . , Y n ) and the sampling matrix A , the ML decoding rule ﬁnds x = ( X 1 , X 2 , . . . , X k ) T ∈ { 0 , 1 } k such that A x ≡ ( Y 1 , Y 2 , . . . , Y n ) T . (14) If there is a unique solution x ∈ { 0 , 1 } k for this linear system, then it is claimed that ˆ x ( Y k 1 ) = x . If there is more than one x satisfying this linear system, then an error is declared. The probability of error is thus equal to P ( k ) e = X x ∈{ 0 , 1 } k 1 2 k Pr( ∃ x 0 6 = x such that A x 0 ≡ A x ) . (15) Due to symmetry , the probabilities Pr( ∃ x 0 6 = x such that A x 0 ≡ A x ) are equal for e very x ∈ { 0 , 1 } k . Thus, we focus on the case where x is the vector of all zeros and consider P ( k ) e = Pr( ∃ x 0 6 = 0 such that A x 0 ≡ 0 ) . (16) By using the union bound, it can be shown that P ( k ) e ≤ X x 0 6 = 0 Pr( A x 0 ≡ 0 ) = k X s =1 X k x 0 k 1 = s Pr( A x 0 ≡ 0 ) = k X s =1  k s  Pr A s X i =1 e i ! ≡ 0 ! (17) where e i is the i -th standard unit vector . The last equality follows from the symmetry of the sampling matrix A . Since all the output samples are generated independently by the identically distributed v i ’ s, each of which has weight d with probability Ω d , P ( k ) e ≤ k X s =1  k s  Pr v 1 · s X i =1 e i ! ≡ 0 !! n = k X s =1  k s  × k X d =1 Ω d Pr v 1 · s X i =1 e i ! ≡ 0      k v 1 k 1 = d !! n ! . (18) W e next analyze Pr v 1 · s X i =1 e i ! ≡ 0      k v 1 k 1 = d ! . (19) Note that v 1 · ( P s i =1 e i ) ≡ 0 if and only if there are ev en number of 1’ s in the ﬁrst s entries of v 1 . This probability equals Pr v 1 · s X i =1 e i ! ≡ 0      k v T 1 k = d ! = P i ≤ d i is ev en  s i  k − s d − i   k d  . (20) 7 W e next provide an upper bound on (20). Deﬁne I d = X i ≤ d i is ev en  s i  k − s d − i  . (21) In the follo wing lemma, we provide an upper bound on I d as a multiple of  k d  . The proof of this lemma is based on that of the similar lemma provided in [15], where the upper bound on I d is stated depending on the regimes of s for a ﬁxed d . W e provide an alternative version where the upper bound on I d depends on the regimes of d for a ﬁxed s . Lemma 1: Consider the case that s ≤ k 2 (i.e., s ≤ k − s ). Deﬁne κ ( s ) = k − s + 1 2 s + 1 . (22) 1) F or d ≤ k 2 (or , k − d ≥ d ), when we deﬁne α = k − d +1 d , I d ≤ (  1 − 2 s 5 α   k d  , when d < κ ( s ) , 4 5  k d  , when d ≥ κ ( s ) . (23) 2) F or d > k 2 (or , k − d < d ), when we deﬁne α 0 = d +1 k − d , I d ≤ (  1 − 2 s 5 α 0   k d  , when d > k − κ ( s ) , 4 5  k d  , when d ≤ k − κ ( s ) . (24) In the case s > k 2 , we can obtain the bounds for I d simply by changing s to k − s . Pr oof: Appendix A. By using Lemma 1 and (20), the upper bound on P ( k ) e in (18) can be further bounded by P ( k ) e ≤ 2 X s ≤ k 2  k s    d κ ( s ) e− 1 X d =1  1 − 2 s 5 α  Ω d + k −d κ ( s ) e X d = d κ ( s ) e 4 5 Ω d + k X d = k −d κ ( s ) e +1  1 − 2 s 5 α 0  Ω d   n = 2 X s ≤ k 2  k s  (1 − Σ s ) n ≤ 2 X s ≤ k 2  k s  e − n Σ s , (25) where we let Σ s = 1 5 k −d κ ( s ) e X d = d κ ( s ) e Ω d + 2 s 5   d κ ( s ) e− 1 X d =1 d Ω d k − d + 1 + k X d = k −d κ ( s ) e +1 ( k − d )Ω d d + 1   . (26) Suppose that the law of Ω d is giv en by a Soliton distribution provided in (11). Here, for simplicity , we assume that D ∈ { 2 , 3 , . . . , k } and k ≥ 3 . For this Soliton distribution, we pro vide an upper bound on  k s  e − n Σ s in (25) for s ≤ k 2 depending on the re gime of d κ ( s ) e with conditions on the sample comple xity n . Lemma 2: W ith the sample comple xity n ≥ c u max  k , k log k ¯ d  (27) for some constant c u > 0 , the term  k s  e − n Σ s is bounded above as follows. 8 Fig. 2. Monte Carlo simulation (5000 runs) of the probability of error P ( k ) e with k = 300 for three different ¯ d ’ s (the query difﬁculties). The sample complexity is normalized by ( k log k ) / ¯ d . W e can observe the phase transition for P ( k ) e around the normalized sample complexity equal to 1 for all the three query difﬁculties considered. 1) If d κ ( s ) e > D ,  k s  e − n Σ s < k − s . (28) 2) If 4 ≤ d κ ( s ) e ≤ D  k s  e − n Σ s ≤ ( k − s if s ≤ √ k , 2 − 2 √ k if √ k < s ≤ k / 2 . (29) 3) If d κ ( s ) e ≤ 3 ,  k s  e − n Σ s ≤ 2 k e − k . (30) Pr oof: Appendix B. W e remark that the case 2) does not happen when D ∈ { 2 , 3 } . From Lemma 2, when the sample complexity n satisﬁes (27) we can further bound P ( k ) e in (25) by P ( k ) e ≤ 2 X s ≤ k 2  k − s + 2 − 2 √ k + 2 k e − k  ≤ c 0  1 k + k 2 − 2 √ k + k 2 k e − k  (31) for some constant c 0 > 0 . Note that this upper bound con ver ges to 0 as k → ∞ . V . S I M U L A T I O N S In this section, we pro vide empirical performance analysis for the probability of error in the recovery of k information bits, as a function of the sample comple xity and query dif ﬁculty . In Fig. 2, we provide Monte Carlo simulation results for the probability of error P ( k ) e , deﬁned in (4), where the number k of information bits to recov er is ﬁxed as k = 300 . W e plot P ( k ) e in terms of the normalized sample complexity , normalized by ( k log k ) / ¯ d where ¯ d is the query dif ﬁculty . W e run the simulations for three different query difﬁculties, ¯ d = 4, 4.7, 5.6. The parity measurements (output symbols) are designed by ﬁrst sampling d (the number of input symbols required to compute a single parity measurement) from the Soliton distribution (11) and then generating the measurements by the encoding rule of Fountain codes. 9 Fig. 3. Same simulation conditions as in Fig 2 except that the horizontal axis is the un-normalized sample complexity . As the query difﬁculty increases, the sample complexity to make P ( k ) e close to 0 decreases. This illustrates the trade-of fs between the query dif ﬁculty and the sample complexity . Observe the phase transition of P ( k ) e around the normalized sample comple xity equal to 1. In Theorem 1, we stated that with sample complexity of c u · max { k , ( k log k ) / ¯ d } for some constant c u > 0 , we can guarantee P ( k ) e → 0 as k → ∞ . The simulation results sho w that c u ≈ 1 is suf ﬁcient to produce a dramatic decrease of P ( k ) e . Since the phase transition occurs in the vicinity of normalized sample complexity equal to 1, the ﬁgure demonstrates the trade-offs between the query difﬁculty and the sample complexity . Speciﬁcally , the required number of parity measurements to reliably recover k information bits is in versely proportional to the query dif ﬁculty when ¯ d = O (log k ) . Note that for the Soliton distrib ution (11), the query dif ﬁculty is O (log k ) , and thus max { k, ( k log k ) / ¯ d } = Θ(( k log k ) / ¯ d ) . In Fig 3, we show the same simulation with un-normalized sample complexity indexing the horizontal axis. From this plot, we can observe that as the query difﬁculty increases, the required number of samples to make P ( k ) e close to 0 decreases. V I . C O N C L U S I O N S In this paper , we analyzed the fundamental trade-of fs between query difﬁculty ¯ d and sample complexity n in a query-based data acquisition system associated with a crowdsourcing task with workers who may be non-responsive to certain queries (channel erasures). W e considered the information recovery of k binary variables ( x 1 , x 2 , . . . , x k ) from parity measurements of subsets of these variables. W e used a query design based on the encoding rules of Fountain codes, with which we can design potentially limitless numbers of queries. W e showed that the proposed query design policy guarantees that the original k information bits can be reco vered with high probability from any set of measurements of size n larger than some threshold. W e obtained necessary and sufﬁcient conditions on sample complexity n ≥ c · max { k , ( k log k ) / ¯ d } . There are se veral interesting future research directions related to this work. One of such directions includes analyzing trade-of fs between query dif ﬁculty and sample complexity for partial information recov ery problems. In this paper , we considered exact information recov ery , meaning we aimed to recover all the k information bits with high probability . But depending on scenarios, it could be enough to recov er only α k of information bits for α ∈ (0 , 1) . Then, the question is how much this relaxed recov ery condition would help in reducing the sample complexity for a gi ven query dif ﬁculty ¯ d . Especially , one interesting question might be whether it is possible to recov er αk information bits with only n = Θ( k ) measurements even with the very lo w query difﬁculty ¯ d = Θ(1) , which does not increase in k . In the exact recovery problem, it was impossible to reliably recov er k input symbols with the sample comple xity n = Θ( k ) when the query difﬁculty is ¯ d = Θ(1) . With ¯ d = Θ(1) , it w as necessary to hav e at least n = Θ( k log k ) sample complexity for the exact recovery , which makes the ratio k /n goes to 0 as k → ∞ . Therefore, it would be interesting to see whether the sample complexity of n = Θ( k ) is suf ﬁcient for the partial recovery problem e ven with the query difﬁculty of ¯ d = Θ(1) . 10 Another interesting direction is to apply the proposed query design to real crowdsourcing systems and to analyze the experimental trade-offs between the query difﬁculty and the sample complexity . Especially , when the collected measurements contain inaccurate answers and the probability that the measurements include inaccurate answers changes depending on the query difﬁculty , the corresponding sample complexity might be a different function of k and ¯ d . Therefore, it would be interesting to ﬁnd the query dif ﬁculty that minimizes the sample complexity in cro wdsourcing systems with random erasures and inaccurate answers, and this direction of research would help guiding the design of sample-efﬁcient crowdsourcing systems. A P P E N D I X A P R O O F O F L E M M A 1 T o prove this lemma, we refer to the similar bound provided in [15]. Lemma 3: Let β = l max n k − d +1 2 d +1 , d +1 2( k − d )+1 om and α = max n k − d +1 d , d +1 k − d o . Then we have X i ≤ d i is odd  s i  k − s d − i  ≥      2 s 5 α  k d  , when s < β , 1 5  k d  , when β ≤ s ≤ k − β , 2( k − s ) 5 α  k d  , when k − β < s. (32) Note that  k d  = X i ≤ d i is odd  s i  k − s d − i  + X i ≤ d i is ev en  s i  k − s d − i  . (33) Therefore, by using Lemma 3, we can ﬁnd an upper bound on P i ≤ d i is ev en  s i  k − s d − i  as a scaling of  k d  such that I d = X i ≤ d i is ev en  s i  k − s d − i  ≤         1 − 2 s 5 α   k d  , when s < β , 4 5  k d  , when β ≤ s ≤ k − β ,  1 − 2( k − s ) 5 α   k d  , when k − β < s. (34) W e deﬁne κ ( s ) = k − s + 1 2 s + 1 . (35) W e ﬁrst consider the case s ≤ k 2 (i.e., s ≤ k − s ). Since β attains its maximum  k 3  at d = 1 or d = k − 1 , we ﬁnd that β < k 2 ≤ k − s. Hence, k − β > s and the last case in (34) cannot happen. 1) For d ≤ k 2 (or , k − d ≥ d ), β =  k − d + 1 2 d + 1  , α = k − d + 1 d . Note that k − κ ( s )+1 2 κ ( s )+1 = s . Since k − d +1 2 d +1 is an increasing function of d , if d < κ ( s ) then β > s . Thus, I d ≤ (  1 − 2 s 5 α   k d  , when d < κ ( s ) , 4 5  k d  , when d ≥ κ ( s ) . (36) 2) For d > k 2 (or , k − d < d ), β =  d + 1 2( k − d ) + 1  , α = d + 1 k − d . 11 Proceeding as above, we get I d ≤ (  1 − 2 s 5 α   k d  , when d > k − κ ( s ) , 4 5  k d  , when d ≤ k − κ ( s ) . (37) In the case s > k 2 , we can obtain the bounds for I d simply by changing s to k − s . A P P E N D I X B P R O O F O F L E M M A 2 In this lemma, we prove an upper bound on  k s  e − n Σ s where Σ s = 1 5 k −d κ ( s ) e X d = d κ ( s ) e Ω d + 2 s 5   d κ ( s ) e− 1 X d =1 d Ω d k − d + 1 + k X d = k −d κ ( s ) e +1 ( k − d )Ω d d + 1   . (38) For the Soliton distribution Ω d =      1 D if d = 1 1 d ( d − 1) if 2 ≤ d ≤ D 0 if d > D or d = 0 , we hav e the query dif ﬁculty log( D + 1) < ¯ d = 1 D + D X d =2 1 d − 1 = D X d =1 1 d < log D + 1 . For simplicity , here we assume that D ≥ 2 . Recall that κ ( s ) = k − s + 1 2 s + 1 , (39) which is a decreasing function of s , and κ ( s ) > 0 for s ≤ k 2 . 1) If d κ ( s ) e > D , Σ s ≥ 2 s 5 d κ ( s ) e− 1 X d =1 d Ω d k − d + 1 > 2 s 5 k D X d =1 d Ω d = 2 s ¯ d 5 k . Thus, if n ¯ d ≥ 5 k log k ,  k s  e − n Σ s < k s exp  − 2 ns ¯ d 5 k  ≤ k s k − 2 s = k − s . 2) If 4 ≤ d κ ( s ) e ≤ D , we ﬁrst notice that s < k − 2 7 ⇔ κ ( s ) > 3 ⇔ d κ ( s ) e ≥ 4 . (40) Thus s ≤ k − 2 7 and κ ( s ) − 1 = k − 3 s 2 s +1 ≥ 4 k 7(2 s +1) ≥ 4 k 21 s . In this case, Σ s ≥ 2 s 5 d κ ( s ) e− 1 X d =1 d Ω d k − d + 1 > 2 s 5 k d κ ( s ) e− 1 X d =2 1 d − 1 > 2 s 5 k log( d κ ( s ) e − 1) ≥ 2 s 5 k log  4 k 21 s  . 12 Moreov er , since k s ≥ 7 , if n ≥ C k for some suf ﬁciently lar ge C , ( C ≥ 68 suf ﬁces) n Σ s ≥ 2 C s 5 log  4 k 21 s  ≥ 4 s log  k s  +  2 C 5 − 4  s log 7 + 2 C s 5 log  4 21  ≥ 4 s log  k s  . From Stirling’ s formula, we also ha ve that √ 2 π n n + 1 2 e − n ≤ n ! ≤ en n + 1 2 e − n , hence  k s  ≤ ek k + 1 2 e − k 2 π ( k − s ) k − s + 1 2 e − ( k − s ) s s + 1 2 e − s ≤ √ k 2 p ( k − s ) s · k k ( k − s ) k − s s s ≤  k s  s  1 − s k  s − k ≤  k s  s e s − s 2 k = exp  s log  k s  + s − s 2 k  ≤ exp  2 s log  k s  . Thus, if n ≥ 68 k ,  k s  e − n Σ s ≤ exp  − 2 s log  k s  =  k s  − 2 s . Note that  k s  − 2 s ≤ ( k − s if s ≤ √ k , 2 − 2 √ k if √ k < s ≤ k / 2 . 3) If d κ ( s ) e = 3 , we ﬁnd from (40) that s ≥ k − 2 7 . Then, by considering the case d = 2 , Σ s ≥ 2 s 5 d κ ( s ) e− 1 X d =1 d Ω d k − d + 1 ≥ 2 s 5( k − 1) ≥ 2( k − 2) 35( k − 1) ≥ 1 35 for k ≥ 3 . Thus, if n ≥ 35 k ,  k s  e − n Σ s ≤ 2 k e − k . 4) If d κ ( s ) e = 1 , 2 , Σ s ≥ 1 5 k −d κ ( s ) e X d = d κ ( s ) e Ω d ≥ Ω 2 5 = 1 10 . Thus, if n ≥ 10 k ,  k s  e − n Σ s ≤ 2 k e − k . 13 R E F E R E N C E S [1] D. R. Karger , S. Oh, and D. Shah, “Budget-optimal task allocation for reliable crowdsourcing systems, ” Operations Research , vol. 62, no. 1, pp. 1–24, 2014. [2] M. S. Bernstein, J. Brandt, R. C. Miller , and D. R. Karger , “Crowds in two seconds: Enabling realtime crowd-po wered interfaces, ” in Pr oceedings of the 24th annual ACM symposium on User interface software and technology . A CM, 2011, pp. 33–42. [3] D. J. MacKay , “Information-based objectiv e functions for active data selection, ” Neural computation , vol. 4, no. 4, pp. 590–604, 1992. [4] B. Settles, “ Activ e learning literature survey , ” University of W isconsin, Madison , vol. 52, no. 55-66, p. 11, 2010. [5] D. V . Lindley , “On a measure of the information provided by an experiment, ” The Annals of Mathematical Statistics , pp. 986–1005, 1956. [6] V . V . Fedorov , Theory of optimal experiments . Elsevier , 1972. [7] E. Abbe and C. Sandon, “Community detection in general stochastic block models: Fundamental limits and efﬁcient algorithms for recov ery , ” in F oundations of Computer Science (FOCS), 2015 IEEE 56th Annual Symposium on . IEEE, 2015, pp. 670–688. [8] B. Hajek, Y . W u, and J. Xu, “Information limits for recov ering a hidden community , ” IEEE T ransactions on Information Theory , 2017. [9] H. W . Chung, B. M. Sadler , L. Zheng, and A. O. Hero, “Unequal error protection querying policies for the noisy 20 questions problem, ” IEEE T ransactions on Information Theory , DOI: 10.1109/TIT .2017.2760634, 2017. [10] T . Tsiligkaridis, B. M. Sadler, and A. O. Hero, “Collaborative 20 questions for target localization, ” IEEE T ransactions on Information Theory , vol. 60, no. 4, pp. 2233–2252, 2014. [11] D. J. MacKay , “Fountain codes, ” IEE Proceedings-Communications , vol. 152, no. 6, pp. 1062–1068, 2005. [12] M. Luby , “L T codes, ” in Pr oceedings. The 43rd Annual IEEE Symposium on F oundations of Computer Science, 2002 . IEEE, 2002, pp. 271–280. [13] A. Shokrollahi, “Raptor codes, ” IEEE T ransactions on Information Theory , vol. 52, no. 6, pp. 2551–2567, 2006. [14] Y . Chen, C. Suh, and A. J. Goldsmith, “Information recovery from pairwise measurements, ” IEEE T ransactions on Information Theory , vol. 62, no. 10, pp. 5881–5905, 2016. [15] K. Ahn, K. Lee, and C. Suh, “Community recov ery in hypergraphs, ” in 2016 54th Annual Allerton Confer ence on Communication, Contr ol, and Computing (Allerton) . IEEE, 2016, pp. 657–663.

Fundamental Limits on Data Acquisition: Trade-offs between Sample Complexity and Query Difficulty

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment