From Small-World Networks to Comparison-Based Search
The problem of content search through comparisons has recently received considerable attention. In short, a user searching for a target object navigates through a database in the following manner: the user is asked to select the object most similar t…
Authors: Amin Karbasi, Stratis Ioannidis, Laurent Massoulie
1 From Small-W orld Networks to Comparison-Based Search Amin Karbasi, Member , IEEE , Stratis Ioannidis, Member , IEEE , and Laurent Massouli ´ e, Member , I EEE Abstract The problem of con tent search thro ugh comparison s has recently received considerable attention. In s hort, a user searchin g for a target object navigates throu gh a database in the following manner: the user is asked to select the object most similar to h er target fro m a sm all list of o bjects. A n e w object list is then presented to the user based on h er earlier selection. This pr ocess is repeated until the target is included in the list presented, at which point the search ter minates. This problem is known to be strong ly related to the small-world network d esign problem. Howev er , contrary to prior work , wh ich focuses on cases wher e objects in the datab ase are equally po pular , we consider here the case where the de mand for objects m ay be hetero geneous. W e show that, u nder hetero geneous demand , the small-world n etwork design p roblem is NP-har d. Giv en the above n egati ve result, we pro pose a novel mech anism for small-w orld design and provide an upper boun d o n its perfo rmance under heterogene ous dem and. The above mechan ism has a n atural equiv alent in th e context of con tent search throug h co mparisons, and we establish both an up per bound and a lower bound for the perform ance of this m echanism. These boun ds are intuitively appealing, as th ey depend on the entropy of the dem and as well as its doub ling con stant, a q uantity capturing the topolog y of the set of target objects. They also illustrate interesting conne ctions between compa rison-based search to classic results fr om informa tion theory . Finally , we p ropose an adaptive lea rning algor ithm for conten t search th at meets the perform ance g uarantees achieved by th e above mechanisms. Amin Karbasi is with the department of computer science at ETHZ , Switzerland (email: amin.karbasi@inf.ethz.ch). Stratis Ioannidis is wit h the T echnicolor Palo Alto, CA-94301, USA (email: str atis.ioannidis@techn icolor .com). Laurent Massouli is with the MSR-INRIA joint r esearch center , ´ Ecole Polytechnique, Paris, F rance (email: laurent.massoulie@inria.fr). Nov ember 15, 2018 DRAFT 2 I . I N T RO D U C T I O N The problem we study in this paper i s content search t hrough comparisons. In sho rt, a user searching for a tar get object na vi gates throu gh a database in the following manner . The user is asked to select the object most similar to her tar get from a small list of objects. A new object list is then presented to the user based on her earlier selection. This process is repeated until the tar get is i ncluded i n the list presented, at which point the search termin ates. Searching through comparisons is typical example of e xplo ratory sear ch [1], the need for which arises when users are unable to state and submi t explicit qu eries to the d atabase. Exploratory search has sev eral important real-life applications. A n often-cited example is navigating t hrough a database of pictures of hum ans in which subjects are photographed under diverse uncontroll ed conditions [2], [3]. For e xample, the pictures may be taken outdoors, from different a ngles or distances, while t he subjects assume diffe rent poses, are partially obscured, etc. Automated methods may fail to extract m eaningful features from such photo s, so the database cannot be queried in the traditi onal fashion. On the other hand, a human searching for a particular person can easily select from a list of pictures the subject mos t sim ilar to the person s he h as in mind. Users may also be unabl e to state qu eries because, e.g. , the y are unfamiliar with the s earch domain, or do not have a clear target in mi nd. For example, a novice classical m usic listener may not be able to express th at she is , e.g. , looking for a fugue or a sonat a. She might howe ver identify among samples of differ ent musi cal pieces th e closest to t he one she h as in mind. Alternative ly , a user surfing the web may not kn o w a priori whi ch pos t she wishes to read; presenting a list of blog posts and letting the surfer identify which one she li kes best can steer her in the right direction. In a ll th e above applications, t he problem of content search through comparisons amount s to determining which objects to present to the u ser i n order to find the target object as quickly as possible. Formally , the behavior of a hum an user can be modeled by a s o-called comparison oracle int roduced by [4]: give n a target and a cho ice bet ween t wo objects, the oracle outputs the one closest to the target. The goal is thus t o find a sequence of proposed pairs of ob jects that leads to the tar get object with as few oracle queries as po ssible. This p roblem was introduced by [4] and has recently recei ved considerable attention (see, for example, [5], [3], [2]). Content search through comparisons is also naturally related to the following prob lem: given Nov ember 15, 2018 DRAFT 3 a graph embedded in a m etric space, how should one augment this graph by adding edges in order to minimize the expected cost of greedy forwarding over thi s graph? This i s known as the small-world network d esign problem (see, for example, [6], [7]) and has a v ariety of applications as, e.g. , in netw ork rou ting. In this paper , we consider bot h problems under the scenario of heter ogeneous demand . This is very interesting i n practice: ob jects in a database are indeed unlikely to be requested wi th th e same frequency . Our cont rib ut ions are as follo ws: • W e show that the small-world network design problem under general heterogeneous d emand is NP-hard. Gi ven e arlier work on this problem u nder homogeneous demand [7], [6], thi s result is int eresting in its o wn right. • W e propose a novel mechanis m for edge addition in the small-world design problem, and provide an upper bound o n its performance. • The above m echanism has a natural equi va lent in the cont e xt of content searc h through comparisons, and we provide a matching upper bound for the perf ormance of this mecha- nism. • W e also establish a lower bo und on any mechanism solving t he cont ent s earch through comparisons problem. • Finally , based on these result s, we propose an adaptive learning algorithm for content search that, giv en access only to a com parison o racle, can meet the performance guarantees achie ved by the abov e mechanisms. T o the best of our kno wledge, we are the first to study t he above tw o problems in a setting of heterogeneous demand. Our analysi s is intuitiv ely app ealing because our up per and lowe r bounds relate the cost of content search to two important p roperties of the demand di stribution, namely its entr opy and its doubling constan t . W e t hus provide performance guarantees in t erms of the bias of the distribution of t ar gets, captured by the entropy , as well as t he topol ogy of their embedding, captured by the doubling constant. The remainder of thi s paper is organized as follo w s. In Section II we provide an overvie w of the related work in this area. In Sections III and IV we introdu ce our notation and formally state the two problems that are the focus of this work, namely content search through comparisons and small-world netw o rk design. W e present our main resul ts in Section V and our adaptive learning algorithm in Section VI. Section VII is dev oted to the proofs of our main theorems. W e Nov ember 15, 2018 DRAFT 4 then address the two extensions of our work in Section VIII and finally conclude in Section IX. I I . R E L A T E D W O R K Content search throug h compariso ns is a s pecial case of nearest neighbour search (NNS), a problem that has been e xtensively studied [8], [9]. Our w ork can be seen as an extension of earlier work [10], [11], [8] cons idering t he N NS problem for objects emb edded in a metric space with a small intrinsic d imension. In particular , authors in [11] i ntroduce navigating nets, a deterministic data structure for supportin g NNS in dou bling metric spaces. A s imilar techniqu e was considered by [8] for objects embedded in a space satisfying a certain sph ere-packing property , while [10] relied on growth restricted metrics; all of th e abo ve a ssumptio ns have connections to the doubling constant we consider in this paper . In all of these works, ho weve r , the underlying metric space is fully observa ble by the search mechanism while, in our work, we are restricted to accesses to a com parison oracle. Moreover , in all of the above works, the demand over the target objects is assumed to be homogeneous. NNS with access to a comparis on oracle was first int roduced by [4], and further explored by [5], [3], [2]. A consid erable adva ntage of th e above works is that the assumpt ion that objects are a-priori embedded in a metric s pace is remov ed; rather than requiring th at similarity between objects is captured by a distance metri c, the above works only assu me that an y two objects can be ranked in terms of their sim ilarity to any tar ger by the comparison oracle. T o provide performance guarantees on the searc h cost, [4] introduced a so-called “disorder-constant”, capturing the degree to which object rankings violate the t riangle inequality . Thi s disorder-c onstant plays roughly the same role in their analysis as the doubling cons tant does in ou rs. Nevertheless, these w o rks also assume hom ogeneous demand, so our work can be seen as an extension of searching with comparisons to heterogeneity , with the cave at of restricting our analysi s to the case where a metric embedding exists. An addit ional im portant distincti on between [4], [5], [3], [2] and ou r work is t he existence of a learning phase, during which explicit qu estions are placed t o the comparison oracle. A data-structure is constructed during thi s phase, which is subsequently used to answer queries submitted to the database du ring a “search” phase. The above works establ ish d if ferent tradeoffs between the length of the learning phase, the sp ace compl exity o f t he data structu re created, and the cost i ncurred during searching. In contrast, the learning scheme we consider in Section VI Nov ember 15, 2018 DRAFT 5 is adaptive, and learning occurs while users searc h; the drawback lies in that ou r guarantees o n the search cos t are asympto tic. Agai n, the main adv ant age of our approach lies in dealing with heterogeneity . The use of in teracti ve methods ( i .e . , that incorporate human feedback) for content search has a l ong history in literature. Arguably , the first oracle considered to model su ch methods is t he so-called mem bership oracle [12], which allows th e search mechanism to ask a user questions o f the form “does the target belong to s et A ” (see also our discussio n in Section III-D). [13] deploys such an interactive method for obj ect clas sification and e valuate it on th e An imals with attributes database. A similar approach was us ed by [14] who formulated shape recogniti on as a coding problem and applied this approach to handwritten numerals and satellite im ages. Having access to a m embership oracle h o wever is a strong assumption, as hum ans may not necessarily be able to answer queries o f the abov e t ype for any obj ect s et A . Moreover , the lar ge numb er of possible sets m ak es the cost of design ing optimal querying strategies over large datasets prohibit i ve. In contrast, the comparison oracle model m ak es a far weaker assu mption on hum an behavior — namely , the ability to compare different objects to the target—and significantly lim its the design space, making sea rch mechanisms using comparisons practical e ven o ver large dat asets. The design of smal l-world networks (also called navigable n etworks) has receive d a lot of attention after the seminal work of [15]. Our work is m ost simil ar to [6], where a condition under which graphs embedded in a doubling metric space can be made na vigable is identified. The same idea was explored in more general spaces by [7]. Again, the main di f ference in our approach to s mall world network design l ies in consi dering heterogeneous demand, an aspect of small-world networks not in vestigated in earlier w o rk. The relations hip between the s mall-world network design and content search has been also observed in ear lier work [4] and was exploited by [5] in proposing their data structures for content search through comp arisons; we further expand on this iss ue in Section IV -C, as this i s an approach w e also follow . I I I . D E FI N I T I O N S A N D N O T A T I O N In thi s section we int roduce some definitions and notation which will be used throughou t this paper . Nov ember 15, 2018 DRAFT 6 A. Objects an d Metric Embedding Consider a set of objects N , where |N | = n . W e assu me that there e xists a metric space ( M , d ), where d ( x, y ) denotes the distance between x, y ∈ M , such that objects i n N are embedded in ( M , d ): i.e. , there exists a o ne-to-one mapping from N to a subset o f M . The objects i n N may represent, for e xamp le, pictures in a database. The m etric e mbedding can be t hought of as a mapping of t he database entries t o a set of features ( e.g. , the age o f person depicted, her hai r and eye color , etc. ). The distance between two ob jects would then captu re how “sim ilar” two objects are w .r .t. these fe atures. In what follo ws , we will abuse not ation and write N ⊆ M , keeping in mind that there mig ht be di f ference between t he p hysical ob jects (the pictures) and th eir em bedding (the attributes th at characterize them). Giv en an obj ect z ∈ N , we can order ob jects according to their dist ance from z . W e will write x 4 z y if d ( x, z ) ≤ d ( y , z ) . Moreo ver , we will write x ∼ z y if d ( x, z ) = d ( y , z ) and x ≺ z y if x 4 z y but not x ∼ z y . Note that ∼ z is an equiv alence relation, and hence partitions N into equi va lence classes. Moreover , 4 z defines a total order over these equiv al ence classes, with respect to thei r di stance from z . Given a non-empt y set A ⊆ N , we denot e by min 4 z A the object i n A closest to z , i.e . min 4 z A = w ∈ A s.t. w 4 z v for all v ∈ A . B. Comparison Oracle A compari son oracle [4] is an oracle that , giv en two objects x, y and a target t , returns the closest object to t . More formally , Oracle ( x, y , t ) = x if x ≺ t y , y if x ≻ t y , x or y if x ∼ t y . (1) Observe that if x = Oracle ( x, y , t ) t hen x 4 t y ; th is does not n ecessarily imply howe ver that x ≺ t y . This oracle basically aim s to capture the beha vior of human users. A human interested in locating, e.g. , a target picture t within the database, may b e able to compare ot her pictures with Nov ember 15, 2018 DRAFT 7 respect t o thei r s imilarity to this t ar get b ut cannot associate a numerical value to this sim ilarity . Moreover , when the pair of pictures compared are equally similar to the target, the decision made by th e h uman m ay be arbitrary . It is important to note here that although we write Oracle ( x, y , t ) to stress that a query always takes place with respect to some target t , in practice the target is hidden and on ly known by the oracle. Al ternati vely , follo wing the “oracle as hum an” analogy , the hum an user has a tar g et in mind and us es it to compare th e two objects, but ne ver discloses it unti l actually being p resented with it. Note that our oracle is weaker than on e that correctly identifies the relationship x ∼ t y and, e.g. , returns a special character “=” once two such objects are prop osed: t o see this , observe that oracle (1) can be implemented by using this strong er oracle. Hence, all our results h old if we are provided with such an oracle instead. C. Demand W e denote by N × N t he set of all ordered pairs of objects in N . For ( s , t ) ∈ N × N , w e will call s the sour ce and t t he targ et of the ordered pair . W e will consider a probabilit y distribution λ o ver all ordered p airs of objects in N wh ich we will call the demand . In other words, λ will be a n on-negati ve function such th at X ( s,t ) ∈N ×N λ ( s, t ) = 1 . In general, the demand can be h eter oge neous as λ ( s, t ) may vary across diff erent s ources and tar gets. W e refer to the m ar ginal distributions ν ( s ) = X t λ ( s, t ) , µ ( t ) = X s λ ( s, t ) , as the sour ce and targ et distributions, respective ly . M oreov er , will refer to the sup port of the tar get distribution T = supp ( µ ) = { x ∈ N : s.t. µ ( x ) > 0 } as the targ et set of the demand. As we will see in Section V, the target distribution µ will play an im portant role in our analysis. In particular , tw o quantities that af fect the performance of searching in our s cheme will be the entropy and the doubling constant of the tar get distribution. W e introduce these t wo notions formally below . Nov ember 15, 2018 DRAFT 8 D. Entr opy Let σ be a probability di strib ution o ver N . The entr opy of σ is defined as H ( σ ) = X x ∈ supp ( σ ) σ ( x ) log 1 σ ( x ) . (2) W e defi ne the max-entr o py of σ as H max ( σ ) = max x ∈ supp ( σ ) log 1 σ ( x ) . (3) The entropy has strong connections with the content search problem. More specifically , sup- pose that we h a ve access to a so-called membership oracle [16] that can answer q ueries of the following form: “Giv en a tar get t and a subs et A ⊆ N , does t belong to A ?” Assume now th at an object t is selected according to a dist rib ut ion µ . It is well kn o wn that to find a t ar get t one needs to submi t at least H ( µ ) queries, on average, to the oracle des cribed above (see, chap. 2, [16]). Mo reo ver , there e xists an algorit hm (Huffman coding) t hat finds the tar get wit h only H ( µ ) + 1 queries on ave rage [16]. In th e worst case, which occurs when the target is the l east frequently selected object, the algorith m requires H max ( µ ) + 1 qu eries to identify t . Our work identifies similar bounds assumi ng that one only has access to a comparison oracle, like th e one described by (1). Not surprisingly , the entropy of the tar get distribution H ( µ ) shows up in the performance bounds that we obtain (Theorems 3 and 4). Howe ver , searching for an object will depend not only on the entropy of the target dist rib ut ion, but also on t he topol ogy of the tar get set T . This will be captured by the doubling con stant of µ , which w e describe i n more detail belo w . E. Doubling Const ant Giv en an o bject x ∈ N , we denote b y B x ( r ) = { y ∈ M : d ( x, y ) ≤ r } (4) the closed ball of radius r ≥ 0 around x . Given a probability dist rib ut ion σ over N and a set A ⊂ N let σ ( A ) = X x ∈ A σ ( x ) . Nov ember 15, 2018 DRAFT 9 Fig. 1. Example of dependence of c ( σ ) on the topology of the support supp ( σ ) . W hen sup p ( σ ) consists of n = 64 objects arranged in a cube, c ( σ ) = 2 3 . If, on t he other hand, these n objects are placed on a plane, c ( σ ) = 2 2 . In both cases σ is assumed to be uniform, and H ( σ ) = log N T ABLE I S U M M A RY O F N OTA T I O N N Set of objects ( M , d ) Metric space d ( x, y ) Distance between x, y ∈ M x 4 z y Ordering w .r .t. distance from z x 4 z y Ordering w .r .t. distance from z x ∼ z y x and y at same distance from z λ The demand distribution ν The source distribution µ The target distribution T The target set H ( σ ) The entropy of σ H max ( σ ) The max-entropy of σ B x ( r ) The ball of radius r centered at x c ( σ ) The doubling constant of σ S The set of shortcut edges L The set of local edges ¯ C S Expected cost of greedy forwarding giv en set S ¯ C F Expected search cost of policy F W e defi ne the doubling constan t c ( σ ) of a distribution σ to be the mi nimum c > 0 for which σ ( B x (2 r )) ≤ c · σ ( B x ( r )) , (5) for an y x ∈ supp ( σ ) and a ny r ≥ 0 . Moreov er , will say that σ is c - doubling if c ( µ ) = c . Note that, contrary to the entropy H ( σ ) , the d oubling cons tant c ( σ ) depends on the topology of supp ( σ ) , determined by the embeddin g of N in the metric space ( M , d ) . This is illustrated in Fig. 1. In this example, |N | = 64 , and t he set N is embedded in a 3-dim ensional cube. Ass ume that σ is t he uniform dist rib ut ion over the N objects; if t hese objects are arranged uniforml y in a cube, then c ( σ ) = 2 3 ; if howe ver these n objects are arranged uniform ly in a 2-dimens ional plane, c ( σ ) = 2 2 . Note t hat, in contrast, the entropy of σ in both cases equals log n (and so does the max-entropy). Nov ember 15, 2018 DRAFT 10 I V . P RO B L E M S T A T E M E N T W e now formally d efine the two prob lems that will b e t he main focus of this paper . T he first is the p roblem of content sear ch thr o ugh comparison s and the second is the small-world network design problem. A. Content Searc h Thr ough Comparisons For the content s earch probl em, we consider t he object set N , em bedded in ( M , d ) . Al though this embeddin g exists, we are constrained by not being abl e t o directly comput e object distances. Instead, we only have access to a comp arison oracle, like t he one defined in Sec tion III-B. Giv en access to the above oracle, we would l ike to navigate through N until we find a target object. In particular , we define gr eedy content sear ch as follo ws. Let t be the target object and s some object that serves as a starting point. Th e greedy content search algorithm proposes an object w and asks the oracle to s elect, between s and w , the object closest to the target t , i .e. , it e vok es Oracle ( s, w , t ) . This process is repeated u ntil the oracle returns something other t han s , i.e. , the proposed object i s “more si milar” to the targe t t . Once this happ ens, say at the prop osal of some w ′ , if w ′ 6 = t , the greedy content search repeats the same p rocess now from w ′ . If at any poin t the proposed object is t , the process terminates. Recall that in t he “oracle as a human” analogy the human cannot rev eal t before actually being presented with it. W e si milarly ass ume here that t is ne ver “re vealed” before actually being p resented to the oracle. Though we write Oracle ( x, y , t ) to s tress that the s ubmitted query is w .r .t. proxim ity to t , the target t is not a priori known. In particular , as we see below , the decision of which objects x and y to present to the oracle cannot directly d epend on t . More formally , let x k , y k be the k -th p air of objects submi tted to the oracle: x k is the curr ent object , which greedy content search is trying to improve upon, and y k is the p r oposed obj ect , submitted to th e oracle for comparison with x k . Let o k = Oracle ( x k , y k , t ) ∈ { x k , y k } be the oracle’ s response, a nd define H k = { ( x i , y i , o i ) } k i =1 , k = 1 , 2 , . . . be the sequence of the first k inputs given to the oracle, as well as the responses obtained; H k is the “hi story” of the content search up to and including the k -th access t o the oracle. Nov ember 15, 2018 DRAFT 11 The source object i s always one o f t he first two objects submitted to the oracle, i.e. , x 1 = s . Moreover , in gree dy content sear ch, x k +1 = o k , k = 1 , 2 , . . . i.e. , the current object is always the closest to the tar get among the ones subm itted so far . On the ot her hand, the selection of the proposed object y k +1 will be determined by the history H k and the object x k . In particular , gi ven H k and the current object x k there exists a mapping ( H k , x k ) 7→ F ( H k , x k ) ∈ N such that y k +1 = F ( H k , x k ) , k = 0 , 1 , . . . , where here we take x 0 = s ∈ N (th e source/starting object) and H 0 = ∅ ( i.e. , before any comparison takes place, there is no history). W e will ca ll th e m apping F the s election policy of th e g reedy content search. In general, we will allow t he selection policy to be ra ndomized; in t his case, the object returned by F ( H k , x k ) will be a random v ariable, whose di strib ution Pr( F ( H k , x k ) = w ) , w ∈ N , (6) is ful ly determined by ( H k , x k ) . Observe t hat F depends on t he target t on ly indirectly , through H k and x k ; thi s is consi stent with our assum ption that t is only “rev ealed” when it is eventually located. W e will say that a selection policy is memoryless if it depends on x k but not on the history H k . In o ther words, t he distribution (6) is t he same when x k = x ∈ N , irrespectively of the comparisons performed prio r t o reaching x k . Our goal is to select F so that we minim ize the number of accesses t o the oracle. In particul ar , giv en a source object s , a tar get t and a selection poli c y F , we define the search cost C F ( s, t ) = inf { k : y k = t } to be th e nu mber of proposals to the oracle unti l t is found. This is a random var iable, as F is randomized; let E [ C F ( s, t )] be its expectation. Th e Content Search Throu gh Com parisons problem is th en d efined as follows: Nov ember 15, 2018 DRAFT 12 C O N T E N T S E A R C H T H R O U G H C O M P A R I S O N S ( C S T C ) : Gi ven an embedding of N into ( M , d ) and a demand distri b uti on λ ( s, t ) , select F that mi nimizes the expected search cost ¯ C F = X ( s,t ) ∈N ×N λ ( s, t ) E [ C F ( s, t )] . Note that, as F is randomized, t he free var iable in the above optimizati on problem is the distribution (6). B. Small-W orld N etwork Design In the small network desig n problem, we again cons ider t he o bjects in N , emb edded in ( M , d ) . It is now assumed howe ver that th e objects in N are connected to each other . The network formed by such con nections is represented by a directed g raph G ( N , L ∪ S ) , where L is the set of local edges and S i s the set of shortcut edges. These edge sets are disjoin t, i.e. , L ∩ S = ∅ . The edges in L are typically assum ed to satisfy the fol lo w ing property: Pr operty 1: For ever y pair of distinct objects x, t ∈ N t here exists an object u adj acent to x such that ( x , u ) ∈ L and u ≺ t x . In other words, for any obj ect x and a target t , x has a local edge leading to an object closer to t . Recall that in the cont ent search problem the goal was to find t (st arting from source s ) using only accesses to a c omparison oracle. Here t he goal is to use such an o racle to route a message from s t o t ove r the links in g raph G . In particul ar , giv en g raph G , we define gre edy forwar ding [15] ov er G as follows. Let Γ( s ) be the neigh borhood o f s , i.e. , Γ( s ) = { u ∈ N s .t. ( s, u ) ∈ L ∪ S } . Giv en a source s a nd a tar get t , greedy forwarding sends a message to neighbo r w of s that is as close to t as possible, i. e . , w = min 4 t Γ( s ) . (7) If w 6 = t , the above process is repeated at w ; if w = t , greedy forwarding terminates. Note that local edges, through Property 1, guarantee that g reedy forwarding from any source s will ev ent ually reach t : th ere will alw ays be a neighbor that i s closer to t than t he ob ject Nov ember 15, 2018 DRAFT 13 currently having the message. Moreover , the closest neighbo ur w selected through (7) can be found using a comparison oracle. In particular , if t he message is at an object x , | Γ( x ) | queries to the o racle wil l suffice t o find the neighbor that is closest to the target. The edges in L are typically called “local” because t hey are usually determined by object proximity . For example, in the classical paper by Kleinberg [15], objects are arranged uni formly in a rectangular k -dimensional grid—with no gaps—and d is taken to be t he Manhattan distance on the gri d. Moreover , there exists an r ≥ 1 such t hat any two objects at distance l ess than r hav e an edge in L . In other words, L = { ( x, y ) ∈ N × N s.t. d ( x, y ) ≤ r } . (8) Assuming e very position in the rectangular grid is occupied, such edges indeed satisfy Proper ty 1. In this w ork, we will not requi re th at edges in L are given by (8) or some other locality-based definition; our only assumption is that the y satisfy Property 1. Ne vertheless, for the sake of consistency with prior work, we also refer to edges in L as “local”. The shortcut edges S need not satisfy Property 1; our goal is to select t hese shortcut edges in a way so that greedy forwarding is as efficient as possible. In particular , we assume that we can select no more than β shortcut edges, where β is a positive integer . For S a subset of N × N such that | S | ≤ β , we denote by C S ( s, t ) the cost of greedy forwarding, in message hops, for forwarding a m essage from s t o t given that S = S . W e allow the sel ection of shortcut edges to be random: the set S can be a random variable ove r all subsets S of N × N such that | S | ≤ β . W e denote by Pr( S = S ) , S ⊆ N × N s.t . | S | ≤ β (9) the distribution of S . Gi ven a so urce s and a t ar get t , let E [ C S ( s, t )] = X S ⊆N ×N : | S |≤ β C S ( s, t ) · Pr( S = S ) be the expected cost of forwarding a mess age from s to t with greedy forwarding, in message hops. W e consider again a heterogeneous demand: a source and target object are selected at random from N × N according to a demand probability distribution λ . The sm all-world network design pr oblem can then be formulated as fol lo ws . Nov ember 15, 2018 DRAFT 14 S M A L L - W O R L D N E T W O R K D E S I G N ( S W N D ) : Given an embedding of N into ( M , d ) , a set o f local edges L , a demand distribution λ , and an integer β > 0 , select a r .v . S ⊂ N × N that minimi zes ¯ C S = X ( s,t ) ∈N ×N λ ( s, t ) E [ C S ( s, t )] subject to |S | ≤ β . In other words, we wish t o select S so that the cos t of greedy forwarding i s minimized. Note th at, since S is a random v ariabl e, the fre e v ariable of the above opti mization problem is essentially the distribution of S , gi ven by (9). C. Rela tionship B etween S W N D and C S T C In what follo ws, we try to gi ve some intuition about ho w SWND and CSTC are related and why the upper bounds we obtain for these two problems are identical, without resorting to the technical details appear ing in our proofs. Consider the following version of the SWND probl em, in whi ch we pl ace three additi onal restrictions to the selection of the short cut edges. First, |S | = n , i.e. , we can only select n = |N | shortcut edges. Second, for every x ∈ N , t here exists exactly one directed edge ( x, y ) ∈ S : each obj ect has e xactly one out-go ing edg e incident to it. Third, the object y to which object x connects to is selected independent ly at each x , according to a p robability di stribution ℓ x ( y ) . In other words, for N = { x 1 , x 2 , . . . , x n } , the jo int distri b uti on of shortcut edges has the form: Pr( S = { ( x 1 , y 1 ) , . . . ( x n , y n ) } ) = n Y i =1 ℓ x i ( y i ) . (10) W e call thi s version of the S W N D problem th e one edge per object version, and denote it by 1 - S W N D . Not e that, in 1 - S W N D, the free v ariables are the distributions ℓ x , x ∈ N , which are to be selected in order to m inimize the av erage cost ¯ C S . Consider no w following content selection pol icy for C T S C: Pr( F ( x k ) = w ) = ℓ x k ( w ) , for all w ∈ N In other words, i f the propos ed object at x k is sam pled according t o the sam e distribution as the sho rtcut edge in 1-SWND. This selection poli c y is memoryl ess as it does not depend on t he history H k of objects p resented to the oracle so f ar . Nov ember 15, 2018 DRAFT 15 t t s s s 2 s 1 Fig. 2. An illustration of the relati onsh ip between 1-SWND and CTS C. In CT SC, the source s samples objects independently from t he same distribution until it locates an object closest to the target t . In 1-SWND, the re-sampling is emulated by the mov ement to new neighbors. E ach neighbor “samples” a ne w object independen t ly , from a sli ghtly perturbed distribution, until one closest to the target t is found. A parallel between these two problems can b e drawn as follows. Supp ose t hat the sam e source/target pair ( s, t ) is given in both probl ems. In content search, while st arting from nod e s , the mem oryless s election policy draws independent s amples from distrib ution ℓ s until an object closer to th e t ar get than s is found. In c ontrast, greedy forwarding in 1-S WND can be described as follo ws. Since sh orcut edges are generated independently , we can assume that th e y are g enerated whi le the message is being forwarded. Then, greedy forwarding at the source object can be seen as samplin g an obj ect from distribution ℓ s , namely , the on e incident to its shortcut edge. If this object is n ot closer to the tar get than s , the message is forwarded to a neigboring node s 1 over a local edge of s . Nod e s 1 then samp les ind ependently a node from distribution ℓ s 1 this time—the one incident to its shorcut edge. Suppose that the distributions ℓ x var y only slightly across neighboring nodes. Then, forwarding over l ocal edges corresponds t o the independent resampl ing occuring in th e content search problem. Each move to a ne w neighbor sam ples a new obj ect (the one incident to its shortcut edge) ind ependently of pre vious objects but from a slight ly perturbed dist rib ut ion. This is repeated until an object closer to the tar get t is found, at which point the m essage moves to a new neighborhood o ver the shortcut edge. Eff ectiv ely , re-sampling is “emulated” in 1-SWND by the mov ement to new neighb ors. This is, of course, an informal argument; we refer the interested reader to the proofs of Theorems 2 and Theorem 3 for a rigorous st atement of the relationship b etween t he t wo problems. Nov ember 15, 2018 DRAFT 16 V . M A I N R E S U L T S W e now present our main resul ts with respect to S W N D and C S T C. Ou r first result is negati ve: opti mizing greedy forwarding i s a hard problem. Theor em 1: SWND is NP-hard. The proof of this theorem can be found in Section VII-A. In sh ort, the proof reduces D O M I N A T - I N G S E T to the decision version of S W N D . Interestingly , the reduction i s to a S W N D i nstance in which (a) the metric space is a 2-dimensional g rid, (b) the distance metric i s the Manhattan distance on the grid and (c) the local edges are given by (8). Thu s, S W N D remains NP-hard e ven in the original setup considered by Kleinber g [15]. The NP-hardness of S W N D s uggests that this problem cannot be solved in its full g enerality . Motiv ated b y this, as well as its relationship to content search through comparisons, w e consider below the restricted version 1 - S W N D . In particul ar , we provide a di stribution of edges for 1 - S W N D for which an upper-bound of search cost exists. This upper-bound ca n be expressed in terms of the entropy and the doubli ng dimensio n of the target dis trib u tion µ . Through the relationship of 1 - S W N D with C S T C , we are able to obtain a greedy content search strategy whose cost c an also be bounded the same way . For a given d emand λ , rec all that µ is the marginal d istribution of the demand λ over the tar get set T , and th at for A ⊂ N , µ ( A ) = P x ∈ A µ ( x ) . Then, for an y two ob jects x, y ∈ N , we define the rank of object y w .r .t. object x as follo ws : r x ( y ) ≡ µ ( B x ( d ( x, y ))) (11) where B x ( r ) is t he cl osed ball with radius r centered at x . Suppose now that shortcut edges ar e generated according to the j oint dist rib ut ion (10), where the outgoing link from an object x ∈ N is sel ected according to the follo wing prob ability: ℓ x ( y ) ∝ µ ( y ) r x ( y ) , (12) for y ∈ supp ( µ ) , while for y / ∈ supp ( µ ) we define ℓ x ( y ) to be zero. Eq. (12) im plies the following appealing properties. • For two objects y , z that have the same distance from x , if µ ( y ) > µ ( z ) then ℓ x ( y ) > ℓ x ( z ) , i.e. , y has a hi gher probability of being connected to x . • When tw o objects y , z are equally likely to be tar gets, if y ≺ x z then ℓ x ( y ) > ℓ x ( z ) . Nov ember 15, 2018 DRAFT 17 Algorithm 1 Memoryless Content Sea rch Requir e: Oracle( · , · , t ) , demand distribution µ , starting object s . Ensur e: target t . 1: x ← s 2: whil e x 6 = t do 3: Sa mple y ∈ N from the probability dis trib ution Pr( F ( H k , x k ) = y ) = ℓ x k ( y ) . 4: x ← Oracle ( x, y , t ) . 5: end while The dist rib ut ion (12) thus biases bo th towards objects close to x as well a s towa rds objects that are likely to be targets. Finally , if the metric space ( M , d ) is a k -dimensional g rid and the targets are uniformly dis trib u ted over N then ℓ x ( y ) ∝ ( d ( x, y )) − k . This is the shortcut distri b uti on used by [15]; Eq (12) is thus a generalization of this distribution to heterogeneous tar gets as well as to more g eneral m etric spaces. Our next theorem, whos e proof is in Section VII -B, relates the cost of greedy forwarding under (12) to the entropy H , the max-entropy H max and the dou bling parameter c of the target distribution µ . Theor em 2: Given a demand λ , consider t he set of shortcut edges S s ampled according t o (10), where ℓ x ( y ) , x, y ∈ N , are given by (12). Then ¯ C S ≤ 6 c 3 ( µ ) · H ( µ ) · H max ( µ ) . Note that t he bound in Theorem 2 d epends on λ only thro ugh the target distribution µ . In particular , it holds for any source distribution ν , and does not requir e that sources are selected independently of the targets t . Moreove r , if N is a k -dim ensional grid and µ is th e u niform distribution over N , the abov e bo und becomes O (log 2 n ) , re trieving thus the result of [15]. Exploitin g an underlying relationship between 1 - S W N D and C S T C, we can obtain an ef ficient selection policy for greedy cont ent searc h. In particular , Theor em 3: Given a demand λ , consider the memoryless selection policy F outlined in Nov ember 15, 2018 DRAFT 18 Algorithm 1. Then ¯ C F ≤ 6 c 3 ( µ ) · H ( µ ) · H max ( µ ) . The proof of this theorem is g i ven in Section VII-C. Like Theorem 2, Theorem 3 characterises the search cost in terms of the doub ling constant, the entropy and the max-entropy o f µ . This is very appealing, given (a) the relatio nship between c ( µ ) and the t opology of the target set and (b) the classic result re garding the entropy and accesses to a membership orac le, as outl ined in Section III. The distributions ℓ x are defined in t erms of the embedding of N in ( M , d ) and the target distribution µ . Interestingl y , howe ver , the bounds of Theorem 3 can be achie ved if neither the embedding in ( M , d ) nor the target distribution µ are a priori known. In ou r technical report [17] we p ropose an adaptive algorithm that asym ptotically achie ves the performance guarantees of Theorem 3 only through acce ss to a comparison oracle. In short, the algorithm learns the ranks r x ( y ) and t he t ar get distribution µ as searches through comp arisons take place. A question arising from Theorems 2 and 3 is how tight these bounds are. Intuitively , we expect t hat the o ptimal shortcut set S and th e op timal selection policy F depend both on the entropy of the target distribution and on its dou bling constant. Our next theorem, whose proof is in S ection VII-D, establishes that this is the case for F . Theor em 4: For any integer K and D , there exists a metric space ( M , d ) and a target measure µ wit h entropy H ( µ ) = K log ( D ) and doubling constant c ( µ ) = D such t hat the average search cost of any selection po licy F satisfies ¯ C F ≥ H ( µ ) c ( µ ) − 1 2 log( c ( µ )) · (13) Hence, the bound in Theorem 3 is tight within a c 2 ( µ ) log( c ( µ )) H max factor . V I . L E A R N I N G A L G O R I T H M Section V establis hed bounds on the cost of greedy content search provided that the distribution (12) is us ed to prop ose it ems to the oracle. Hence, if t he embeddin g of N in ( M , d ) and target distribution µ are known, it is possible to p erform greedy content search with the performance guarantees provided by Theorem 3 . In this secti on, we turn our attention to how such bounds can be achieved if neither the embedding in ( M , d ) nor the target distribution µ are a priori known. T o thi s end, we propose a Nov ember 15, 2018 DRAFT 19 novel adaptive algorit hm th at achie ves the performance guarantees of Theorem 3 without access to the above information. Our algorithm ef fectiv ely l earns t he ranks r x ( y ) of objects and the target distribution µ as time progresses. It does not requi re that distances between objects are at any point disclosed; instead, we assume t hat it onl y has access t o a comp arison oracle, slight ly st ronger than the one described in Section IV -B. It is i mportant to note that our algorithm is adapti ve : thoug h we pro ve its conv ergence under a stationary regime, the algo rithm can operate in a dynamic en v ironment. For example, new objects can be added t o the database while old ones can be removed. Moreover , t he popularity of objects can change as time prog resses. Provided that such changes happen infrequently , at a lar ger timescale compared to the timescale in which database queries are submitted, our algorith m will be a ble to adapt and con ver ge to the desired behavior . A. Demand Model and Pr obabili stic Oracle W e assume that t ime is sl otted and that at each timeslot τ = 0 , 1 , . . . a new query is generated in the d atabase. As before, we assume that the source and tar get of the ne w query are selected according t o a demand distribution λ o ver N × N . W e again denote by ν , µ th e (m ar ginal) source and tar get distributions, respectively . Our algorith m will require that the support of both the source and target distributions is N , and more precisely that λ ( x, y ) > 0 , for all x , y ∈ N . (14) The requi rement that the target set T = supp ( µ ) is N is necessary to ensure learning; w e can only in fer the relativ e order w .r .t. objects t for which questi ons of the form Oracle ( x, y , t ) are subm itted to the oracle. M oreov er , i t is natural in our model to assume that the source distribution ν is at t he discretion of our algorithm: w e can cho ose which objects to prop ose first to the u ser/oracle. In t his s ense, for a giv en targe t distribution µ s.t. supp ( µ ) = N , (14) can be enforced, e.g. , by selecting source obj ects uniform ly at random from N and independent ly of the tar get. W e consider a sl ightly stronger oracle than the on e described in Section IV -A. In particular , Nov ember 15, 2018 DRAFT 20 we again assum e that Oracle ( x, y , t ) = x i f x ≺ t y , y if x ≻ t y . (15) Howe ver , we further assume that if x ∼ t y , then Oracle ( x, y , t ) can return eit her of the two possible outcomes with non-zer o pr obability . Th is is stronger than the oracle in Section IV -A, where we assumed t hat the out come wil l be arbitrary . W e sho uld po int out here that this is st ill weaker than an oracle that correctly identifies x ∼ t y ( i.e. , the hu man states that these objects are at equal dist ance from t ) as, given such an oracle, we can i mplement the above probabilistic oracle by simply returning x or y with equal probability . B. Data S tructur es For e very object x ∈ N , the database storin g x als o maintains the following ass ociated data structures. T he first data structu re is a counter keeping t rack of how often the object x h as been requested so far . T he second data st ructure maint ains an order of the objects in N ; at any point in time, th is total order is an “estim ator” of 4 x , the order of objects with respect to t heir distance from x . W e describe eac h one of th ese two data st ructures in more detail belo w . a) Estimating the T ar get Distri b ut ion: The first data structure asso ciated with an o bject x is an estim ator of µ ( x ) , i.e. , the probabilit y with whi ch x i s selected as a t ar get. A s imple metho d for keeping track of t his i nformation is through a counter C x . This counter C x is in itially set to zero and i s incremented e very t ime object x is th e t ar get. If C x ( τ ) i s the counter at timeslot τ , then ˆ µ ( x ) = C x ( τ ) / τ (16) is an un biased estimator of µ ( x ) . T o av oid count ing to infinity a “moving a verage” ( e.g. , and exponentially weigh ted moving a verage) could be used instead. b) Maintaining a P artial Order: The second data structure O x associated wit h each x ∈ N maintains a tot al order of obj ects in N w .r .t. their simil arity to x . It supports an o peration called order () that returns a partit ion of objects in N along with a total order ov er this partition. In particul ar , the out put of O x . order () consists o f an o rdered sequence of disjoi nt sets A 1 , A 2 , . . . , A j , where S A i = N \ { x } . Int uitiv ely , any two objects in a set A i are considered Nov ember 15, 2018 DRAFT 21 to be at equal distance from x , w hile among two objects u ∈ A i and v ∈ A j with i < j the object u is assumed to be the closer to x . Moreover , e very time t hat the algorithm e vokes Ora cle ( u, v , x ) , and learns, e.g. , that u 4 x v , the data structure O x should be updated to reflect thi s inform ation. In particular , if the algorit hm has learned so far the order relationships u 1 4 x v 1 , u 2 4 x v 2 , . . . , u i 4 x v i (17) O x . order () should return the objects in N sorted in such a way that all relations hips in (17) are respected. In particular , object u 1 should appear before v 1 , u 2 before v 2 , and so forth. T o that effe ct, the data st ructure shoul d also suppo rt an operation called O x . add ( u , v ) that adds t he order relationship u 4 x v t o the constraints respected by the output of O x . order (). A simple (b u t not the mo st ef ficient) way of implementing this data struct ure is to represent order relationshi ps through a di rected acyclic graph. Initially , the graph’ s vertex s et is N and its edge set is em pty . Ev ery time an operation add ( u , v ) is executed, an edge is added between vertices u and v . If th e addition of the n e w edg e creates a cycle then all nodes in the cycle are collaps ed to a single node, k eeping thus the graph ac ycli c. Note that the creation of a cycle u → v → . . . → w → u im plies that u ∼ x v ∼ x . . . ∼ x w , i.e. , all these nodes are at equ al distance from x . Cycles can be detected by using depth-first search ov er the DA G [18]. The sets A i returned by order () are the sets associated with each collapsed node, w hile a total order among them t hat respects the constraints implied by the edges in the D A G can be obt ained either by depth-first search or by a topological s ort [18]. Hence, the add () and order () o perations ha ve a worst case cost of Θ( n + m ) , where m is the total num ber of edges in t he graph. Se veral more efficient algo rithms exist in literature (see, for example,[19], [20], [21]), where the best (in terms o f performance) proposed by [21] y ielding a cost of O ( n ) for order () and an aggregate cost of at m ost O ( n 2 log n ) for any sequence of add operations. W e stress h ere that any of these more ef ficient impl ementations could be used for our purposes. W e refer the reader interested in such implementati ons t o [19], [20], [21] and, to av oid any ambiguit y , we assume the a bove na ¨ ıve approach for the remaind er of this work. Nov ember 15, 2018 DRAFT 22 C. Gre edy Content Sear ch Our learning alg orithm implements greedy content search, as described in Section IV -A, in the following manner . Wh en a n e w query is submi tted to the database, the algorithm first selects a sou rce s uniforml y at random. It then performs greedy content search usi ng a m emoryless selection policy ˆ F with distribution ˆ ℓ x , i.e. , Pr( F ( H k , x k ) = w ) = ˆ ℓ x k ( w ) w ∈ N . (18) Belo w , we discuss i n detail how ˆ ℓ x , x ∈ N , are computed. When the current object x k , k = 0 , 1 , . . . , is equal to x , the algorithm evok es O x k . order () and obtains an ordered partition A 1 , A 2 , . . . , A j of items in N \ { x } . W e define ˆ r x ( w ) = i : w ∈ A i X j =1 ˆ µ ( A j ) , w ∈ N \ { x } . This can be seen as an “estimato r” of the true rank r x giv en by (11). The distri b uti on ˆ ℓ x is t hen computed as follo ws: ˆ ℓ x ( w ) = ˆ µ ( w ) ˆ r x ( w ) 1 − ǫ ˆ Z x + ǫ n − 1 , i = 1 , . . . , n − 1 , (19) where ˆ Z x = P w ∈N \{ x } ˆ µ ( w ) / ˆ r x ( w ) is a normalization factor and ǫ > 0 is a small constant. An alternative view o f (19 ) i s that the object proposed is selected u niformly at random with probability ǫ , and proportionally to ˆ µ ( w i ) / ˆ r x ( w i ) wi th probability 1 − ǫ . The u se of ǫ > 0 guarantees that every search ev entual ly finds the tar get t . Upon locating a t ar get t , any access to the o racle in the history H k can be used to update O t ; in particular , a call Oracle( u , v , t ) that returns u im plies the constraint u 4 t v , which sho uld be added to the data structure through O t . add ( u, v ). Note t hat this operation can take place only at the end of th e gr eedy cont ent sear ch ; the outcomes of calls to t he oracle can be observed, but the tar get t is re vealed onl y after it has been l ocated. Our m ain result is that, as τ tends to i nfinity , t he above algorithm achiev es performance guarantees a rbitrarily close to the ones of Theorem 3. Let ˆ F ( τ ) be the selection policy defined by (18) at timesl ot τ and denote by ¯ C ( τ ) = X ( s,t ) ∈N ×N λ ( s, t ) X s ∈N E [ C ˆ F ( τ ) ( s, t )] the expected search cost at t imeslot τ . Then the fol lo wi ng theorem holds: Nov ember 15, 2018 DRAFT 23 Theor em 5: Assu me that for any two tar gets u, v ∈ N , λ ( u, v ) > 0 . lim sup τ →∞ ¯ C ( τ ) ≤ 6 c 3 ( µ ) H ( µ ) H max ( µ ) (1 − ǫ ) where c ( µ ) , H ( µ ) and H max ( µ ) are the dou bling parameter , the entropy and the max entropy , respectiv ely , of the tar get distribution µ . The proof of this theorem ca n be found i n Section VII-E. V I I . A N A L Y S I S This section i ncludes th e proofs of our th eorems. A. Pr oof of Theor em 1 W e first prove t hat the randomized version of S W N D is no harder than its deterministi c version. Define D E T S W N D to be the same as S W N D with the additional restriction that S i s deterministic. For any random variable S ⊂ N that s atisfies |S | ≤ β , there exists a determini stic set S ∗ s.t. | S ∗ | ≤ β and ¯ C S ∗ ≤ ¯ C S . In particular , this is true for S ∗ = arg min S ∈N , | S |≤ β C S ( s, t ) . Thus, S W N D is equi valent to D E T S W N D. In particular , any so lution of D E T S W N D wil l also be a solut ion of S W N D . Moreover , given a solution S of S W N D any determini stic S belonging to the s upport o f S will be a solution of D E T S W N D . W e therefore turn our attent ion on D E T S W N D . W ithou t loss of generality , we can assume that the weight s λ ( s, t ) are arbitrary non-negative n umbers, as d i vi ding eve ry weight by P s,t λ ( s, t ) does not change the optim al solutio n. The decision problem correspond ing to D E T S W N D is as follows D E T S W N D - D : Giv en an embeddi ng of N into ( M , d ) , a set of local edges L , a non- negati ve wei ght functi on λ , and two constants α > 0 and β > 0 , is there a directed edge set S such that | S | ≤ β and P ( s,t ) ×N ×N λ ( s, t ) C S ( s, t ) ≤ α ? Note that, giv en the set of shorcut edg es S , forwarding a message with greedy forwarding from any s to t can t ake place in pol ynomial time. As a result, D E T S W N D - D is in NP . W e will prove it is also NP-hard by reducing the following NP-complete problem to it: Nov ember 15, 2018 DRAFT 24 D O M I N A T I N G S E T : Given a graph G ( V , E ) and a constant k , is there a set A ⊆ V such that | A | ≤ k and Γ( A ) ∪ A = V , where Γ( A ) th e n eighborhood of A in G ? Giv en an instance ( G ( V , E ) , k ) of D O M I NA T I N G S E T , we construct an instance of D E T S W N D - D as fol lo w s. The set N in this i nstance wil l be embedded in a 2-dimensional grid, and the distance m etric d will be the M anhattan distance on the grid. In particular , let n = | V | be the size of the graph G and, w .l.o.g., assume t hat V = { 1 , 2 , . . . , n } . Let ℓ 0 = 6 n + 3 , (20) ℓ 1 = nℓ 0 + 2 = 6 n 2 + 3 n + 2 , (21) ℓ 2 = ℓ 1 + 3 n + 1 = 6 n 2 + 6 n + 3 . (22) ℓ 3 = ℓ 0 = 6 n + 3 , (23) W e construct a n 1 × n 2 grid, where n 1 = ( n − 1 ) · ℓ 0 + 1 and n 2 = ℓ 1 + ℓ 2 + ℓ 3 + 1 . That is , the total number of nodes in the grid is N = [( n − 1) · ℓ 0 + 1] · ( ℓ 1 + ℓ 2 + ℓ 3 + 1) = Θ( n 4 ) . The object set N wi ll be the set of nodes in the above grid, and t he metric space will be ( Z 2 , d ) where d is the Manhatt an dist ance on Z 2 . The local edges L is defined according to (8) wi th r = 1 , i.e. , and any t wo adjacent nodes in t he g rid are connected by an edge in L . Denote by a i , i = 1 , . . . , n , the n ode on the first colum n of the grid that resides at ro w ( i − 1) ℓ 0 + 1 . Simi larly , denote by b i , c i and d i the nodes o n the col umns ( ℓ 1 + 1 ) , ( ℓ 1 + ℓ 2 + 1 ) and ( ℓ 1 + ℓ 2 + ℓ 3 + 1) t he grid, respectively , that reside at the same row as a i , i = 1 , . . . , n . These nodes are depicted i n Figure 3. W e define th e weig ht function λ ( i, j ) over the pairs o f nodes in th e gri d as follows. Th e p airs of grid nodes that receiv e a non-zero weight are th e on es belonging to one of the follo wing sets : A 1 = { ( a i , b i ) | i ∈ V } , A 2 = { ( b i , b j ) | ( i, j ) ∈ E } ∪ { ( c i , d j ) | ( i, j ) ∈ E } ∪ { ( c i , d i ) | i ∈ V } , A 3 = { ( a i , d i ) | i ∈ V } . The sets A 1 and A 2 are depicted in Fig. 3 with dashed and s olid lines, respectively . Note that | A 1 | = n as it contains one pair for each vertex in V , | A 2 | = 4 | E | + n as i t contains four pairs Nov ember 15, 2018 DRAFT 25 S 1 S 2 1 2 3 4 5 ℓ 0 ℓ 1 ℓ 2 ℓ 3 b 1 b 5 b 2 b 3 b 4 a 1 a 4 a 5 a 3 a 2 d 4 d 5 d 3 d 1 d 2 c 4 c 5 c 3 c 2 c 1 Fig. 3. A reductio n of an instance of D O M I NAT I N G S E T to an instance of D E T S W N D - D. On ly th e nodes on the grid tha t h a ve n on-zero incoming or ou tgoing demand s (weigh ts) are depicted. The dashed a rrows dep ict A 1 , the set o f pa irs that receive a weight W 1 . Th e solid arrows depict A 2 , the set of pairs that r ecei ve weight W 2 . for each edge in E and one pair for each vertex in V , and, finally , | A 3 | = n. The p airs in A 1 recei ve a weight equal to W 1 = 1 , the pairs in A 2 recei ve a weight equal to W 2 = 3 n + 1 and the pairs in A 3 recei ve a weight equal to W 3 = 1 . For the bounds α and β take α = 2 W 1 | A 1 | + W 2 | A 2 | + 3 | A 3 | W 3 = (3 n + 1)(4 | E | + n ) + 5 n (24) and β = | A 2 | + n + k = 4 | E | + 2 n + k . (25) The above construction can take place in polynomial time in n . Moreover , if the graph G has a d ominating set of si ze no mo re than k , one can construct a determin istic set of shortcut edges S that satisfies the c onstraints of D E T S W N D - D . Lemma 6: If the in stance o f D O M I NA T I N G S E T is a “yes” instance, then the cons tructed instance of D E T S W N D - D is also a “yes” inst ance. Pr oof: T o see this, suppose that there e xists a dom inating set A of t he graph with size | A | ≤ k . Then, for eve ry i ∈ V \ A , there exists a j ∈ A such that i ∈ Γ( j ) , i.e. , i is a neighbor of j . W e construct S as fol lo ws . For ev ery i ∈ A , add the edges ( a i , b i ) and ( b i , c i ) in S . For e very i ∈ V \ A , add an edge ( a i , b j ) in S , where j is such that j ∈ A and i ∈ Γ( j ) . For e very pair in A 2 , add thi s edge in S . The size o f S is Nov ember 15, 2018 DRAFT 26 1 2 3 4 5 ℓ 0 ℓ 1 ℓ 2 ℓ 3 A = { 1 , 4 } b 1 b 5 b 2 b 3 b 4 a 1 a 4 a 5 a 3 a 2 d 4 d 5 d 3 d 1 d 2 c 4 c 5 c 3 c 2 c 1 Fig. 4. A “yes” instance of D O M I N A T I N G S E T and the co rresponding “yes” instan ce o f D E T S W N D - D . The graph on th e left is can b e do minated by two node s, 1 and 4 . The correspon ding set S of shortcut co ntacts that satisfies the constraints of D E T S W N D - D is depicted on th e right. |S | = 2 | A | + ( | V | − | A | ) + | A 2 | = | A | + n + 4 | E | + n ≤ 4 | E | + 2 n + k . Moreover , the weighted forwarding distance i s ¯ C w S = X ( i,j ) ∈ A 1 W 1 C S ( i, j ) + X ( i,j ) ∈ A 2 W 2 C S ( i, j ) + X ( i,j ) ∈ A 3 W 3 C S ( i, j ) . W e ha ve X ( i,j ) ∈ A 2 W 2 C S ( i, j ) = W 2 | A 2 | as e very pair in A 2 is connected by an edge i n S . Consider now a pair a i , b i ) ∈ A 1 , i ∈ V . There is exactly one edge in S departing from a i which has t he form ( a i , b j ) , where where either j = i is or j a neighbor of i . T he distance of the cl osest local neighbor of a i from b i is ℓ 1 − 1 . The distance of b j from b i is at most n · ℓ 0 . As ℓ 1 − 1 = nℓ 0 + 2 − 1 > nℓ 0 greedy forwarding will follow ( a i , b j ) . If b j = b i , then C S ( a i , b i ) = 1 . If b j 6 = b i , as j is a neighbor of i , S contains the edge ( b j , b i ) . Hence, i f b j 6 = b i , C S ( a i , b i ) = 2 . As i w as arbitrary , we get that X ( i,j ) ∈ A 1 W 1 C S ( i, j ) ≤ 2 W 1 n. Next, consi der a pair ( a i , d i ) ∈ A 3 . For t he s ame re asons as for the pair ( a i , b i ) , the shortcut edge ( a i , b j ) in S will be used by the greedy forwarding algorithm . In particular , the dis tance o f Nov ember 15, 2018 DRAFT 27 the closest lo cal neighbor of a i from d i is ℓ 1 + ℓ 2 + ℓ 3 − 1 and d ( b j , d i ) is at most ℓ 2 + ℓ 3 + n · ℓ 0 . As ℓ 1 − 1 > nℓ 0 , greedy forw arding will follow ( a i , b j ) . By the construction of S , b j is such that j ∈ A . As a result, again by the construction of S , ( b j , c j ) ∈ S . T he closest local neighbor of b j to d i has ℓ 2 + ℓ 3 + d ( b j , b i ) − 1 M anhattan distance from d j . Any s hortcut n eighbor b k of b j has at least ℓ 2 + ℓ 3 Manhattan distance from b i . On the other hand, c j has ℓ 3 + d ( b j , b i ) Manhatt an distance from d i . As ℓ 2 > 1 and ℓ 2 > nℓ 0 ≥ d ( b j , b i ) , the greedy forwarding algori thm will follow ( b j , c j ) . Finally , as A 2 ⊂ S , and j = i or j is a neighbor of i , the edge ( c j , d i ) will be in S . Hence, the greedy forwarding algorithm will rea ch d j in e xactl y 3 steps. As i ∈ V was arbi trary , we get that X ( i,j ) ∈ A 3 W 3 C S ( i, j ) = 3 W 3 n. Hence, ¯ C w S ≤ 2 W 1 n + W 2 | A 2 | + 3 W 3 n = α and, therefore, t he instance of D E T S W N D - D is a “yes” instance. T o complete the p roof, we show that a dom inating set of size k exists only if there exists a S that satisfies the c onstraints in constucted inst ance of D E T S W N D - D . Lemma 7: If th e con stucted instance of D E T S W N D - D i s a “yes” instance, then the instance of D O M I NA T I N G S E T is als o a “yes” instance. Pr oof: Assu me that there exists a set S , with |S | ≤ β such that the augmented graph has a weighted forwarding distance less than or equal to α . T hen A 2 ⊆ S . (26) T o see thi s, suppose that A 2 6⊆ S . Then, there is at l east one pair of nodes ( i, j ) in A 2 with C S ( i, j ) ≥ 2 . Therefore, ¯ C w S ≥ 1 · W 1 | A 1 | + [( | A 2 | − 1) · 1 + 2] · W 2 + 1 · W 3 | A 3 | = (3 n + 1)(4 | E | + n ) + 5 n + 1 >α, a contradiction. Essentially , by choo sing W 2 to be lar ge, we enforce t hat all “demands” in A 2 are sati sfied by a direct edge in S . The next lemma shows a simil ar result for A 1 . Using shortcut edges to satisfy these “demands” is enforced by maki ng the distance ℓ 1 very l ar ge. Nov ember 15, 2018 DRAFT 28 Lemma 8: For e very i ∈ V , there exists at least one shortcut edge in S whose origi n is in the same row as a i and in a column to the left of b i . Moreover , this edge is used during th e g reedy forwarding of a message from a i to b i . Pr oof: Suppose not. Then, there exists an i ∈ V such that no shortcut edge has its origin between a i and b i , or such an edge exists but is not used by the greedy forwarding from a i to b i ( e.g . , because it points too far from b i ). Then, the greedy forwarding from a i to b i will u se only local edges and, hence, C S ( a i , b i ) = ℓ 1 . W e thus ha ve that ¯ C w S ≥ ℓ 1 + 2 n − 1 + W 2 | A 2 | (21) = 6 n 2 + 5 n + 1 + W 2 | A 2 | On the ot her hand , by (24) α = 5 n + W 2 | A 2 | so ¯ C w S > α , a contradiction. Let S 1 be the set of all edges who se origin is between some a i and b i , i ∈ V , and that are used during forwarding from t his a i to b i . Note that Lemma 8 implies that | S 1 | ≥ n . The target of any edge i n S 1 must lie to the l eft o f t he 2 ℓ 1 + 1 -th colum n of the g rid This i s because the Manhattan distance of a i to b i is ℓ 1 , so its left local neighbo r lies at ℓ 1 − 1 steps from b i . Greedy forwarding is mono tone, so the Manhattan dis tance from b i of any target of an edge fol lo wed subsequently to route towards b i must be less than ℓ 1 . Essentially , all edges in S 1 must point close eno ugh to b i , otherwise they would not be used in greedy forwarding. This implies that, to forward the “demands” in A 3 an addit ional set of shortcut edges need to be used. Lemma 9: For ev ery i ∈ V , there exists at least one shortcut edge in S th at is used wh en forwarding a message from a i to d i that is neither in S 1 nor in A 2 . Pr oof: Suppose not. W e established abov e that the target of an y edge in S 1 is to the left of the 2 ℓ 1 + 1 col umn. R ecall that A 2 = { ( b i , b j ) | ( i, j ) ∈ E } ∪ { ( c i , d j ) | ( i, j ) ∈ E } ∪ { ( c i , d i ) | i ∈ V } . By the definition of b i , i ∈ V , t he targets of th e edges in { ( b i , b j ) | ( i, j ) ∈ E } lie on the ( ℓ 1 + 1) - th column. Similarly , the origins of the edges in { ( c i , d j ) | ( i, j ) ∈ E } ∪ { ( c i , d i ) | i ∈ V } lie on the ℓ 1 + ℓ 2 + 1 -th column. As a result, if th e lemma does not h old, there is a demand in A 3 , say ( a i , d i ) , that does not use any additional shortcut edges. Thi s means t hat the distance between t he 2 ℓ + 1 and the ℓ 1 + ℓ 2 + 1 -th column is t ra versed by using local edges. Hence, C S ( a i , d i ) ≥ ℓ 2 − ℓ 1 + 1 as at least one additional step is needed to get to the 2 ℓ 1 + 1 -th colu mn Nov ember 15, 2018 DRAFT 29 from a i . This implies that ¯ C w S ≥ = 2 n + W 2 | A 2 | + ℓ 2 − ℓ 1 (22) = W 2 | A 2 | + 5 n + 1 > α, a contradiction. Let S 3 = S \ ( S 1 ∪ A 2 ) . Lemm a 9 im plies t hat S 3 is no n-empty , while (26) and Lemma 8 , along with the f act that |S | ≤ β = | A 2 | + n + k , impl y that | S 3 | ≤ k . The following lem ma st ates that some of t hese edg es must hav e tar gets that ar e close enough t o the destinatio ns d i . Lemma 10: For each i ∈ V , there exists an edge in S 3 whose tar get is within Manhatt an distance 3 n + 1 of either d i or c j , where ( c j , d i ) ∈ A 2 . Moreover , this edg e i s used for forwarding a message from a i to d i with greedy forwarding. Pr oof: Suppose not. Then there exists an i for which greedy forwarding from a i to d i does not employ any edge fitting the description i n the l emma. Then, the destination d i can not be reached by a sh ortcut edge in either S 3 or A 1 whose tar get is closer th an 3 n + 1 steps. Thus, d i is reached i n one of the two follo wi ng ways: either 3 n + 1 steps are required i n rea ching it, through forwarding over local edges, or an edge ( c j , d i ) in A 2 is used to reach it. In the latt er case, reaching c i also requires at least 3 n + 1 steps of local forwarding, as no edge in A 2 or S 3 has an target within 3 n steps from it, a nd any edge in S 1 that may be this close i s not used (by the hypothesis ). As a result, C S ( a i , d i ) ≥ 3 n + 2 as at least one addi tional step i s required in reaching the ball of radius 3 n centered around d i or c i from a i . This gi ves ¯ C w S ≥ 5 n + W 2 | A 2 | + 1 > α, a contradiction. When forwarding from a i to d i , i ∈ V , there may be more than o ne edges in S 3 fitting th e description in Lemma 10. For each i ∈ V , consider the las t of all these edges. Denote the resulting subset by S ′ 3 . By d efinition, | S ′ 3 | ≤ | S 3 | ≤ k . For each i , there exists exactly o ne edg e in S ′ 3 that is used to forward a message from a i to d i . Moreover , recall that ℓ 0 = ℓ 3 = 6 n + 3 . Therefore, the Manhattan d istance between any two nodes in { c 1 , . . . , c n } ∪ { d 1 , . . . , d n } is 2(3 n + 1) + 1 . As a result, the targets of the edges in S ′ 3 will b e wi thin dist ance 3 n + 1 of e xactly one of the nodes in the above set. Nov ember 15, 2018 DRAFT 30 Let A ⊂ V be t he set of all vertices i ∈ V su ch t hat the u nique edge i n S ′ 3 used in forwarding from a i to d i has an target wi thin dist ance 3 n + 1 of either c i or d i . Then A is a dom inating set of G , and | A | ≤ k . T o see this, note first that | A | ≤ k because each target of an edge in S ′ 3 can be within dis tance 3 n + 1 of only one of th e nodes in { c 1 , . . . , c n } ∪ { d 1 , . . . , d n } , and th ere are at most k edges in S ′ 3 . T o see that A domi nates the graph G , suppose that j ∈ V \ A . Then, by Lemma 10, the edge in S ′ 3 corresponding to i is either poi nting wit hin d istance 3 n + 1 of either d j or a c i such that ( c i , d j ) ∈ A 2 . By the construction of A , it canno t point in the proximity of d j , because then j ∈ A , a contradiction. Similarly , it cannot point i n the p roximity of c j , because then, again, j ∈ A , a contradiction. Therefore, it points i n the prox imity of so me c i , where i 6 = j and ( c i , d j ) ∈ A 2 . By t he construction of A , i ∈ A . Moreove r , by the definition of A 2 , ( c i , d j ) ∈ A 2 if and only if ( i, j ) ∈ E . Therefore, j ∈ Γ( A ) . As j was arbitrary , A is a dominating set o f G . B. Pr oof of Theor em 2 According to (12 ) , the probabili ty t hat obj ect x links to y is gi ven by ℓ x ( y ) = 1 Z x µ ( y ) r x ( y ) , where Z x = X y ∈T µ ( y ) r x ( y ) is a norm alization factor bounded as fol lo ws . Lemma 11: For an y x ∈ N , let x ∗ ∈ min 4 x T b e an y object in T amon g the closest tar gets to x . Then Z x ≤ 1 + ln(1 /µ ( x ∗ )) ≤ 3 H max . Pr oof: Sort the target set T from the closest to furth est object from x and index objects in an increasing sequence i = 1 , . . . , k , so the objects at the same dist ance from x recei ve the same index. Let A i , i = 1 , . . . , k , be the set contain ing objects indexed by i , and let µ i = µ ( A i ) and µ 0 = µ ( x ) . Furthermore, let Q i = P i j =0 µ j . Then Z x = P k i =1 µ i Q i . Define f x ( r ) : R + → R as f x ( r ) = 1 r − µ ( x ) . Clearly , f x ( 1 Q i ) = P i j =1 µ j , for i ∈ { 1 , 2 . . . , k } . This m eans that we can re write Z x as Z x = k X i =1 ( f x (1 /Q i ) − f x (1 /Q i − 1 )) /Q i . Nov ember 15, 2018 DRAFT 31 By reordering t he terms in volved in the sum above, we get Z x = f x ( 1 Q k ) /Q k + k − 1 X i =1 f x (1 /Q i ) 1 Q i − 1 Q i +1 . First note t hat Q k = 1 , and second that since f x ( r ) is a decreasing function, Z x ≤ 1 − µ 0 + Z 1 /Q 1 1 /Q k f x ( r ) dr = 1 − µ 0 Q 1 + ln 1 Q 1 . This sho ws that i f µ 0 = 0 then Z x ≤ 1 + ln 1 µ 1 or otherwise Z x ≤ 1 + ln 1 µ 0 . Giv en the set S , recall that C S ( s, t ) is t he number of steps required by t he greedy forwarding to reac h t ∈ N from s ∈ N . W e say that a message at object v is in phase j if 2 j µ ( t ) ≤ r t ( v ) ≤ 2 j +1 µ ( t ) . Notice that the number of dif ferent phases is at most log 2 1 /µ ( t ) . W e can write C S ( s, t ) as C S ( s, t ) = X 1 + X 2 + · · · + X log 1 µ ( t ) , (27) where X j are the ho ps occurring in phase j .Assum e that j > 1 , and let I = w ∈ N : r t ( w ) ≤ r t ( v ) 2 . The probability t hat v lin ks to an object i n the set I , and hence m oving to phase j − 1 , is X w ∈ I ℓ v,w = 1 Z v X w ∈ I µ ( w ) r v ( w ) . Let µ t ( r ) = µ ( B t ( r )) and ρ > 0 be the smallest radius such that µ t ( ρ ) ≥ r t ( v ) / 2 . Since we assumed that j > 1 such a ρ > 0 exists. Clearly , for any r < ρ we hav e µ t ( r ) < r t ( v ) / 2 . In particular , µ t ( ρ/ 2) < 1 2 r t ( v ) . (28) On the ot her hand, since the doubling parameter is c ( µ ) we have µ t ( ρ/ 2) > 1 c ( µ ) µ t ( ρ ) ≥ 1 2 c ( µ ) r t ( v ) . (29) Therefore, by com bining (28) and (29) we obtain 1 2 c ( µ ) r t ( v ) < µ t ( ρ/ 2) < 1 2 r t ( v ) . (30) Nov ember 15, 2018 DRAFT 32 Let I ρ = B t ( ρ ) be the set of objects wi thin ra dius ρ/ 2 fr om t . Th en I ρ ⊂ I , so X w ∈ I ℓ v,w ≥ 1 Z v X w ∈ I ρ µ ( w ) r v ( w ) . By triangle inequ ality , for any w ∈ I ρ and y su ch that d ( y , v ) ≤ d ( v , w ) we have d ( t, y ) ( a ) ≤ d ( v , y ) + d ( v , t ) ≤ d ( w , y ) + d ( v , t ) ( b ) ≤ d ( t, w ) + d ( v , t ) + d ( v , t ) ( c ) ≤ 1 2 d ( v , t ) + d ( v , t ) + d ( v , t ) ≤ 5 2 d ( v , t ) , where in ( a ) and ( b ) we used t he triangle inequal ity and in ( c ) we used the fact t hat ρ/ 2 < d ( v , t ) / 2 . This means that r v ( w ) ≤ µ t ( 5 2 d ( v , t )) , and consequently , r v ( w ) ≤ c 2 ( µ ) r t ( v ) . There- fore, X w ∈ I ℓ v,w ≥ 1 Z v P w ∈ I ρ µ ( w ) c 2 ( µ ) r t ( v ) = 1 Z v µ t ( ρ/ 2) c 2 ( µ ) r t ( v ) . By (30), t he probabil ity of terminating phase j is uniforml y bounded by X w ∈ I ℓ v,w ≥ min v 1 2 c 3 ( µ ) Z v Lem. 11 ≥ 1 6 c 3 ( µ ) H max ( µ ) (31) As a result, the probability of terminat ing phase j is stochastically domi nated by a geometric random va riable wi th the parameter given in (31). This is because (a) if t he current o bject does not ha ve a shortcut edge which lies in the s et I , by Property 1, greedy forwarding sends the m essage t o one of t he neig hbours that is closer to t and (b) shortcut edges are sampled independently across neighbours. Hence, gi ven that t is t he t ar get object and s is th e source object, E [ X j | s, t ] ≤ 6 c 3 ( µ ) H max ( µ ) . (32) Nov ember 15, 2018 DRAFT 33 Suppose no w that j = 1 . By t he triang le inequalit y , B v ( d ( v , t )) ⊆ B t (2 d ( v , t )) and r v ( t ) ≤ c ( µ ) r t ( v ) . Hence, ℓ v,t ≥ 1 Z v µ ( t ) c ( µ ) r t ( v ) ≥ 1 2 c ( µ ) Z v ≥ 1 6 c ( µ ) H max ( µ ) since object v i s in the first phase and thus µ ( t ) ≤ r t ( v ) ≤ 2 µ ( t ) . Consequently , E [ X 1 | s, t ] ≤ 6 c ( µ ) H max ( µ ) . (33) Combining (27), (32 ), (33 ) and using the linearity of expectation, we get E [ C S ( s, t )] ≤ 6 c 3 ( µ ) H max ( µ ) log 1 µ ( t ) and, thus, ¯ C S ≤ 6 c 3 ( µ ) H max ( µ ) H ( µ ) . C. Proof of Theor em 3 The idea of the proof is very simi lar to the pre vious one and foll o ws t he same path. Recall that the s election po licy is mem oryless and determined by Pr( F ( H k , x k ) = w ) = ℓ x k ( w ) . W e ass ume that the desired object is t and t he cont ent search starts from s . Since there are no local edges, the on ly way that the greedy search moves from the current object x k is by proposing an object that i s closer to t . Like in the SWND case, we are in particular interested in bounding the probabilit y that the ra nk of th e prop osed obj ect is roughly half the rank of the current object. Thi s way we can compu te how fast we make progress in our search. As the search moves from s t o t we say that the search is i n phase j w hen the rank of the current object x k is between 2 j µ ( t ) and 2 j +1 µ ( t ) . As s tated earlier , the greedy searc h algorit hm keeps making comparisons until it finds another object closer to t . W e ca n write C F ( s, t ) as C F ( s, t ) = X 1 + X 2 + · · · + X log 1 µ ( t ) , where X j denotes the number of com parisons done by comparison oracle in phase j . Let us consider a particular phase j and denote I the set o f objects whose ranks from t are at m ost r t ( x k ) / 2 . Note that phase j will terminate if the comp arison oracle proposes an obj ect from set I . The probability that th is happens is X w ∈ I Pr( F ( H k , x k ) = w ) = X w ∈ I ℓ x k ,w . Nov ember 15, 2018 DRAFT 34 Note that the sum on the right hand sid e depends o n the distribution of shortcut edges and is independent of local edges. T o bound this sum we can us e (31). Hence, with probabilit y at least 1 / (6 c 3 ( µ ) H max ( µ )) , p hase j will termi nate. In other words, usi ng the above selection policy , if the current object x k is in phase j , with probability 1 / (6 c 3 ( µ ) H max ( µ )) the proposed object will be i n phase ( j − 1) . This defines a g eometric random variable which yields to t he fact that on a verage th e number of queries needed to halve the rank is at mos t 6 c ( µ ) 3 H max or E [ X j | s, t ] ≤ 6 c ( µ ) 3 H max . T aking a verage over the demand λ , we can conclude th at the a verage number of com parisons is less than ¯ C F ≤ 6 c 3 ( µ ) H max ( µ ) H ( µ ) . D. Pr oof of Theor em 4 Our proof amounts to const ructing a metric space and a tar g et distribution µ for which the bound holds . Our construction will be as follows. For some integers D, K , the tar get set N is taken as N = { 1 , . . . , D } K . The dist ance d ( x, y ) b etween two distin ct elements x, y of N is defined as d ( x, y ) = 2 m , where m = max { i ∈ { 1 , . . . , K } : x ( K − i ) 6 = y ( K − i ) } . W e then have the following Lemma 12: Let µ be the uniform distribution o ver N . Then (i) c ( µ ) = D , and (ii) if the tar get distribution is µ , the opt imal a verage search cost C ∗ based on a comp arison oracle satisfies C ∗ ≥ K D − 1 2 . Before proving Lemma 12, we note that Theorem 4 im mediately follows as a corollary . Pr oof: Part (i): Let x = ( x (1) , . . . x ( K )) ∈ N , and fix r > 0 . Assum e first that r < 2 ; then, the b all B ( x, r ) contains only x , while the ball B ( x, 2 r ) contains either only x if r < 1 , or precisely tho se y ∈ N s uch that ( y (1) , . . . , y ( K − 1)) = ( x (1) , . . . , x ( K − 1 )) if r ≥ 1 . In the latter case B ( x, 2 r ) contains precisely D elements. Hence, for such r < 2 , and for the uniform measure on N , the in equality µ ( B ( x, 2 r ) ) ≤ D µ ( B ( x, r )) (34) holds, and wit h equalit y if in addition r ≥ 1 . Nov ember 15, 2018 DRAFT 35 Consider now t he case where r ≥ 2 . Let the integer m ≥ 1 be such that r ∈ [2 m , 2 m +1 ) . By definition of the metric d on N , the ball B ( x, r ) consist s of all y ∈ N s uch that ( y (1) , . . . , y ( K − m )) = ( x (1) , . . . , x ( K − m )) , and hence contains D min( K,m ) points. Simi larly , the ball B ( x, 2 r ) contains D min( K,m +1) points. Hence (34) als o holds when r ≥ 2 . Part (ii): W e assume that the compariso n oracle, in addit ion to returning one of the t wo proposals that i s closer to the target, als o reve als the distance of the prop osal it returns to the tar get. W e further assu me that upon selecti on of the i nitial search candid ate x 0 , its dist ance to the unknown target is also rev ealed. W e now establish t hat the lo wer boun d on C ∗ holds when th is additional information is av ailable; it holds a fortiori for our m ore resticted comparison oracle. W e decompose the search procedure into phases, depending on the current distance to the destination. Let L 0 be the i nteger such that the i nitial proposal x 0 is at distance 2 L 0 of the t ar get t , i.e. ( x 0 (1) , . . . , x 0 ( K − L 0 )) = ( t (1) , . . . , t ( K − L 0 )) , x 0 ( K − L 0 + 1) 6 = t ( K − L 0 + 1) . No informati on on t can be obtained b y submitti ng propos als x such that d ( x, x 0 ) 6 = 2 L 0 . Thus, to be useful, the next proposal x must share its ( K − L 0 ) first components with x 0 , and diffe r from x 0 in it s ( K − L 0 + 1) -t h entry . No w , keeping trac k of previous propos als m ade for which the distance to t remained equal t o 2 L 0 , the best choice for the next proposal consists in picking it again at distance 2 L 0 from x 0 , but choosing for its ( K − L 0 + 1) -th entry one that has not b een proposed so f ar . It is easy to see that, with this s trategy , the number of additi onal proposals after x 0 needed to lea ve this ph ase is uniforml y dist rib ut ed on { 1 , . . . D − 1 } , the number of op tions for the ( K − L 0 + 1) -th entry of the target. A sim ilar ar gum ent entails that the number of propo sals made in each ph ase equals 1 plus a uniform random v ariable on { 1 , . . . , D − 1 } . It remains to control the number of phases. W e ar g ue t hat it admits a Bino mial d istribution, with parameters ( K , ( D − 1) /D ) . Indeed, as we make a propos al which takes us into a new phase, no information is av ailable on the next entries of the tar get, and for each such entry , the ne w proposal mak es a correct guess with probabil ity Nov ember 15, 2018 DRAFT 36 1 /D . Thi s yields the announced Binomial d istribution for the numbers o f phases (when it equals 0, the in itial proposal x 0 coincided with the target). Thus the optimal number of search steps C verifies C ≥ P X i =1 (1 + Y i ) , where the Y i are i.i.d., uniformly distributed on { 1 , . . . , D − 1 } , and i ndependent of the random va riable X , which admits a Binomial di stribution with parameters ( K , ( D − 1) /D ) . Thus usin g W ald’ s id entity , we obtain that E [ C ] ≥ E [ X ] E [ Y 1 ] , which readily imp lies (ii). Note that the lo wer bo und in (ii) has been establ ished for search strategies th at utilize the entire search h istory . Hence, it is not restricted to memoryless search. E. Pr oof of Theor em 5 Let ∆ µ = sup x ∈N | ˆ µ ( x ) − µ ( x ) | . Observe first that, by the weak law of large numbers, for any δ > 0 lim τ →∞ Pr(∆ µ > δ ) = 0 . (35) i.e. , ˆ µ con ver ges to µ in probability . The lemm a belo w states, for e very t ∈ N , the order data structure O t will learn t he corr ect order of any tw o objects u, v in finite time. Lemma 13: Consider u, v , t ∈ N s uch that u 4 t v . Then, the order data s tructure in t e vokes O t . add ( u , v ) after a finite time, with probability one. Pr oof: Recall t hat O t . add ( u , v ) is e voked if and on ly if a call Oracle ( u, v , t ) t ake s place and it returns u . If u ≺ t v t hen Oracle ( u, v , t ) = u . If, on the other hand, u ∼ t v , t hen Oracle ( u, v , t ) returns u with non-zero probabil ity . It t hus suffices to show th at such, for l ar ge enough τ , a call Oracle ( u, v , t ) occurs at timeslot τ with a no n-zero probabilit y . By the hypothesis of Theorem 5, λ ( u, t ) > 0 . By (19), given that t he source is u , the probabil ity that ˆ F ( u ) = v conditioned on ˆ µ is ˆ ℓ u ( v ) ≥ µ ( v ) − ∆ µ 1 + ( n − 1)∆ µ 1 − ǫ n − 1 + ǫ n − 1 ≥ µ ( v ) − ∆ µ (1 + ( n − 1)∆ µ )( n − 1) as ˆ Z v ≤ n − 1 and | ˆ µ ( x ) − µ ( x ) | ≤ ∆ µ for e very x ∈ N . Thu s, for any δ > 0 , the p robability that is lo wer-bounded by λ ( u, t ) Pr( ˆ F ( u ) = v ) ≥ µ ( v ) − δ 1 + ( n − 1) δ Pr(∆ µ < δ ) . Nov ember 15, 2018 DRAFT 37 By taking δ > 0 small er t han µ ( v ) , we hav e by (35) that there exists a τ ∗ s.t. for all τ > τ ∗ the probability that Oracle ( u, v , t ) takes place at ti meslot τ is bounded away from z ero, and the lemma follo ws. Thus i f t i s a target t hen, after a finit e time, for any two u , v ∈ N the ordered partit ion A 1 , . . . , A j returned by O t . order () wil l respect the relationship between u, v . In particular for u ∈ A i , v ∈ A i ′ , if u ∼ t v then i = i ′ , while if u ≺ t v then i < i ′ . As a result , the estimated rank of an object u ∈ A i w .r .t. t will satisfy ˆ r t ( u ) = X x ∈T : x ≺ v u ˆ µ ( x ) + X x ∈N \T : x ∈ A i ′ ,i ′ ≤ i ˆ µ ( x ) = r t ( u ) + O (∆ µ ) i.e. the est imated rank will be close to the true rank, pro vi ded that ∆ µ is small. Moreover , as in Lemma 11, it can be sho wn that ˆ Z v ≤ 1 + log − 1 ˆ µ ( v ) = 1 + log − 1 [ µ v + O (∆ µ )] for v ∈ N . From these, for ∆ µ small enough, we have that for u , v ∈ N , ˆ ℓ u ( v ) = [ ℓ u ( v ) + O (∆ µ )](1 − ǫ ) + ǫ 1 n − 1 . Follo wi ng the same steps as the proof of Theorem 2 we can show that, giv en that ∆ µ ≤ δ , the expected search cost is u pper b ounded by 6 c 3 H H max (1 − ǫ )+ O ( δ ) . This gi ves us that ¯ C ( τ ) ≤ h 6 c 3 H H max (1 − ǫ ) + O ( δ ) i Pr(∆ µ ≤ δ ) + n − 1 ǫ Pr(∆ µ > δ ) where the second part foll o ws from the fact th at, by using t he uniform dis trib u tion wi th probability ǫ , we ensure th at the cost is stochasti cally upper-bounded by a geometric r .v . with parameter ǫ n − 1 . Thus, by (35), lim sup τ →∞ ¯ C ( τ ) ≤ 6 c 3 H H max (1 − ǫ ) + O ( δ ) . As this is true for all sm all enou gh delta, the theorem fol lo ws . V I I I . E X T E N S I O N S In this section we d iscuss two possib le e xtensi ons to the prob lem of cont ent search through comparisons. The first one is about empower ing the comparison oracle, namely , assuming that one has access to a stronger oracle which is able to return the mo st simil ar object to t he t ar get among a set of objects. If we choo se the size of the set to be equal to two, we are back to our pre vious framew ork. The second o ne is about content search when we lift the assumption that objects are embedded in a metric s pace. Nov ember 15, 2018 DRAFT 38 A. Content Searc h Be yond Comparison Oracle A pr oximity oracle is an oracle that, giv en a set A of size at mo st κ and a t ar get t , returns the closest object to t . More form ally , Oracle ( A, t ) = x, if x 4 y , ∀ x, y ∈ A. (36) Note that the comparison oracle i s a special case of the proximit y o racle where | A | = 2 . Moreover , accessing κ tim es the comparison oracle, one can implement the prox imity oracle. Theor em 14: Given a demand λ , consider the memoryless and independent selection policy Pr( F ( H k , x k ) = ( w 1 , w 2 , . . . , w κ )) = κ Y i =1 ℓ x k ( w i ) where ℓ x k ( w i ) is given by (12). Then th e cost of greedy content search is bounded as foll o ws: ¯ C F ≤ 6 c 3 ( µ ) κ · H ( µ ) · H max ( µ ) . Pr oof: W e assum e that the target is object t and th e con tent search starts from s . Th e o nly way that the greedy search moves from the current obj ect x k is by proposing a set A that contain s an closer to t . Li ke in Section 3, we are in particul ar interested in bounding the probabi lity that the rank of the proposed object is roughly half the rank o f t he current object. T his way we can compute ho w fast we make progress i n our search. As the search moves from s t o t we say that the search is i n phase j w hen the rank of the current object x k is between 2 j µ ( t ) and 2 j +1 µ ( t ) . W e can write C F ( s, t ) as C F ( s, t ) = X 1 + X 2 + · · · + X log 1 µ ( t ) , where X j denotes the n umber of comparisons done by comparison oracle in phase j . L et us con- sider a particular phase j and denote I the set of objects whose ranks from t are at m ost r t ( x k ) / 2 . Moreover , let the proposed set b y the selection pol icy be F ( H k , x k ) = ( w 1 , w 2 , . . . , w κ ) . Note that phase j will terminate if one of t he objects ( w 1 , w 2 , . . . , w κ ) is from set I . W e denote by F i , 1 ≤ i ≤ κ, the ev ent that w i ∈ I . Since F i ’ s are independent, the probabilit y that p hase j terminates is X ( w 1 ,w 2 ,...,w κ ) ∈ I κ Pr( F 1 ∪ F 2 ∪ · · · ∪ F κ ) ≥ κ X w i ∈ I ℓ x k ( w i ) ! Nov ember 15, 2018 DRAFT 39 T o boun d the last expression we can use (31). Hence, with probabilit y at least κ/ (6 c 3 ( µ ) H max ( µ )) , phase j will terminate. This defines a geometric random variable whi ch yields t o the fac t that on avera ge th e nu mber of queries needed to h alve the rank i s at most 6 c ( µ ) 3 H max /κ or E [ X j | s, t ] ≤ 6 c ( µ ) 3 H max /κ. T aking a verage o ver the demand λ , we can conclude that the av erage number of com parisons is less than ¯ C F ≤ 6 c 3 ( µ ) H max ( µ ) H ( µ ) /κ. B. Content Searc h Be yond Metric Spaces Similarity between o bjects is a well defined relationsh ip e ven if the objects are not embedded in a metric space. More specifically , the notation x 4 z y si mply states t hat x is more similar to z than y . If the only informati on given about t he underlying space is the s imilarity between obj ects, then th e maximum we can hope for is for each object x ∈ N sort other objects N \ y according to their s imilarity t o x . Giv en the demand λ , th e target set T is completely specified. For any y ∈ T let us define the rank as follo ws : r x ( y ) = |{ z : z ∈ T , z 4 x y }| . W e say that y ∈ T is the k -th closest object to x if r x ( y ) = k . First not that the rank i s in general asy mmetric, i.e., r x ( y ) 6 = r y ( x ) . Second, the triangl e inequalit y is n ot sati sfied in general, i.e., r x ( y ) r y ( z ) + r z ( x ) . Howe ver the approximate i nequality as introduced in [4] is always satisfied. Mo re precisely , we say that the disorder factor D ( µ ) is th e smallest D such that we hav e the appr oximate tr iangle inequal ity r x ( y ) ≤ D ( r z ( y ) + r z ( x )) , for all x, y , z ∈ T . The factor D ( µ ) basically quantifies the non-homogeneity of the underlying space when the onl y gi ve i nformation is order of objects. Let the selectio n policy for the non- metric space be defined as follo ws: Pr( F ( H k , x k ) = w ) ∝ 1 r x k ( w ) , (37) Nov ember 15, 2018 DRAFT 40 for w ∈ T . In case w / ∈ T we define Pr( F ( H k , x k ) = w ) to be zer o. It is of high interest to see whether we can st ill na vigate through the database when the characterization of the underlying space is un kno wn and onl y the si milarity relationship between objects is provided. This is the main theme o f t he n e xt theorem. Theor em 15: Consi der t he above selection pol icy . Then for any demand λ , the cost of greedy content search is bounded as ¯ C F ≤ 7 D ( µ ) log 2 |T | . The proof of t his Th eorem is given below . Note again that the selection policy is mem oryless. Furthermore, it is universal in a sense that using thi s selection policy for an y kind of demands guarantees the search that only depends on the cardinality of target set and its disorder factor . For instance, this selection policy is useful when the target set is only known a priory and the demand is not fully specified. Pr oof: The selectio n poli c y in the non-metric space scenario is given (37) which imp lies that onl y objects in t he target set T are going to be p roposed by t he algorithm. Th erefore, except for t he starting poin t x 0 = s , the algorithm s navigates only throug h the target set. The probabi lity of proposing w ∈ T wh en x k is the current object of the s earch is given b y Pr( F ( H k , x k ) = w ) = 1 Z x k 1 r x k ( w ) , where Z x k = P w ∈T r − 1 x k ( w ) . Consequently , Z x k = H |T |− 1 if x k ∈ T , H |T | if x k / ∈ T , where H n is th e n -th h armonic number . Hence, z x k ≤ 2 log |T | . As th e s earch moves from s to t we s ay that the search is in phase j when t he rank of the current obj ect v 6 = s with respect to t is 2 j ≤ r t ( v ) ≤ 2 j +1 . Clearly , there are only log |T | different phases. The greedy search algorithm keeps p roposing to t he o racle unt il it finds another object clos er to t . W e can write C F ( s, t ) as C F ( s, t ) = X 1 + X 2 + · · · + X log |T | + X s , where X s denotes the number of com parisons done by oracle at the starting point until it go es to an object u ∈ T such that r s ( u ) ≤ r s ( t ) . As before X j ( j > 0 ) is the number of comparisons done by oracle until it goes to the next phase. Nov ember 15, 2018 DRAFT 41 W e need t o differentiate between the starting point of the process and the rest of it. Since unlike other obj ects proposed by the alg orithm, the starting o bject s may not be in the target set. Let the rank of t with respect to s be k , i.e., r s ( t ) = k . Then, the probabilit y that the greedy search algorithm proposes an object v ∈ T such that r s ( v ) ≤ r s ( t ) is P k j =1 j H |T | ≤ 1 2 log | T | . As a result E [ X s | s, t ] ≤ 2 log |T | . This i s the a verage number of comparisons performed by the oracle until the greedy search algorithm escapes from the starting object s . Let the current object v 6 = s be in phase j . W e denote by I = u : u ∈ T , r t ( u ) ≤ r t ( v ) 2 , the set of objects whose rank from t is at mo st r t ( v ) / 2 . Clearly , | I | = r t ( v ) / 2 . The probability that the g reedy search proposes an object u ∈ I (and hence going to t he next phase) i s at least X u ∈ I 1 2 log |T | 1 r v ( u ) ( a ) ≥ r t ( v ) 4 log |T | D ( µ )( r t ( u ) + r t ( v )) , where in ( a ) we used the approximate t riangle in equality . Since for u ∈ I , we h a ve r t ( u ) ≤ r t ( v ) / 2 , the probabilit y of going from v to the next phas e is at least 6 D lo g |T | . There fore, E [ X j | s, t ] ≤ 6 D log |T | . Using the lin earity of expectation, E [ C F ( s, t )] ≤ 6 D log 2 |T | + 2 log |T | ≤ 7 D log 2 |T | . The above conditional expectation does not depend on the demand λ . Hence, the expected search cost for any demand is bounded as E [ C F ] ≤ 7 D log 2 |T | . I X . C O N C L U S I O N S In this w ork, we init iated a study of C T S C and SWND under heterogeneous demands, tying performance to the topolog y and the entropy of t he targe t distrib ution. Our study leave s severa l open probl ems, including im proving upper and lower bounds for both C S T C and S W N D . Giv en the relationship between these two, and the NP-hardness of S W N D , characterizing the complexity of C S T C is also interesting. Al so, rather t han considering rest ricted versions of S W N D , as we did h ere, devising approximat ion algorithm s for the original problem is another possible direction. Earlier work on comparison oracles eschewed metri c sp aces alto gether , exploiting what where referred to as disor der inequalities [4 ], [5], [3]. Applying these under heterogeneity is als o a Nov ember 15, 2018 DRAFT 42 promising research direction. Finally , trade-offs between space complexity and the cos t of the learning phase vs . the cost s of answering database queries are in vestigated i n the above works, and the sam e trade-offs could be studied in the context of heterogeneity . R E F E R E N C E S [1] R . White and R. Roth, Exploratory Sear ch: Beyon d the Query-Response P ara digm . Mor gan & Claypool, 2009. [2] D. Tschopp and S. N. Diggavi, “Facebro wsing: S earch and navigation through comparisons, ” i n IT A workshop , 2010. [3] D. Tschopp and S. N. Diggavi, “ Approximate nearest neighbor search through comparisons. ” 2009. [4] N. Go yal, Y . Lifshits, and H. Schu tze, “Disorder inequality: a combinatorial ap proach to nearest n eighbor search., ” in WSDM , 2008. [5] Y . Lifshits and S. Z hang, “Combinatorial algorithms for nearest neighbors, near-duplicates and small- wo rld design, ” in SOD A , 2009. [6] P . F raigniaud , E. Lebhar , and Z. L otk er, “ A doubling dimension threshold θ (log log n ) for augmented graph navigability , ” in ESA , 2006. [7] P . Fraigniaud and G. Giakkoupis, “On the searchability of small-world networks with arbitrary underlying str ucture, ” in STOC , 2010. [8] K. L. Clarkson, “Nearest-neighbor searching and metric space dimensions, ” in Neare st-Neighbor Methods for Learning and V isi on: Theory and Practice (G. Shakhnarovich , T . Darrell, and P . Indyk, eds.), pp. 15–59, MIT P ress, 2006. [9] P . Ind yk and R. M otwani, “ Approximate nearest neighbors: T o wards removing the curse of dimensionality , ” in ST OC , pp. 604–6 13, 1998. [10] D. Karger and M. Ruhl, “F inding nearest neighbors in growth-restricted metrics, ” in SOD A , 2002. [11] R. Krauthgamer and J. R. Lee, “Navigating nets: simple algorithms for proximity search, ” in SODA , 2004. [12] M. Garey , “Optimal binary identification procedures, ” in SIAM J. Appl. Math , 1972. [13] S . B ranson , C. W ah, F . Schroff, B. Babenk o, P . W elinder , P . Perona, and S. Belongie, “V isual recognition with humans in the loop, ” in ECCV (4) , pp. 438–45 1, 2010 . [14] D. Geman and B. Jedynak, “Shape recognition and twenty questions, ” tech. rep., in Proc. Reconnaissance des Formes et Intelligence Art ificielle (RFIA, 1993. [15] J. Kleinberg, “The small-world phenome non: An algorithmic perspecti ve, ” i n STOC , 2000. [16] T . M. Cov er and J. Thomas, Elements of I nformation Theory . W iley , 1991. [17] A. Karbasi, S. Ioannidis, and L. Massoulie, “Content search t hroug h comparisons, ” T ech. Rep. CR-PRL -2010 - 07-00 02, T echnicolor , 2010. h ttp://www .thlab .net/ ∼ stratis/navigability/techrepo rt.pdf. [18] T . Cormen, C. L eiserson, R. Riv est, and C. Stein, Intr oduction to Algorithms . MIT Press and McGraw-Hill, 2nd ed., 2001. [19] B. Haeupler , T . Kavitha, R. Mathew , S . Sen, and R. T arjan, “Incremen tal cycle detection, topological ordering, and strong componen t maintenance , ” in SOD A , 2008. [20] D. Pearce and P . K elly , “Online algorithms for maintaining the topological order of a directed acyclic graph, ” tech. rep., Imperial College, 2003. [21] M. Bender , J. Fineman, and S . Gilbert, “ A new approach t o incremental topological ordering, ” in SODA , 2009. Nov ember 15, 2018 DRAFT
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment