Telex: Principled System Support for Write-Sharing in Collaborative Applications
The Telex system is designed for sharing mutable data in a distributed environment, particularly for collaborative applications. Users operate on their local, persistent replica of shared documents; they can work disconnected and suffer no network la…
Authors: ** Lamia Benmouffok, Jean‑Michel Busca, Joan Manuel Marquès
apport de recherche ISSN 0249-6399 ISRN INRIA/RR--6546--FR+ENG Thème COM INSTITUT N A TION AL DE RECHERCHE EN INFORMA TIQUE ET EN A UTOMA TIQUE T elex: Principled System Support f or Writ e-Sharing in Collaborativ e Applications Lamia Benmouffok — Jean-Michel Busca — Joan Manu el Marquès — Marc Shapiro — Pierre Sutra — Geor gios Tsoukalas N° 6546 9 May 2008 Centre de recherche INRIA Paris – Rocquencour t Domaine de V oluceau, Rocquencour t, BP 105, 78153 Le Chesnay Cedex Téléphone : +33 1 39 63 55 11 — Téléco pie : +33 1 39 63 53 30 T elex: Prinipled System Supp ort for W rite-Sharing in Collab orativ e Appliations ∗ Lamia Benmouok † ‡ , Jean-Mi hel Busa † ‡ , Joan Man uel Marquès § ‡ , Mar Shapiro † ‡ , Pierre Sutra † ‡ , Georgios T souk alas ¶ Thème COM Systèmes omm unian ts Équip e-Pro jet Regal Rapp ort de re her he n ° 6546 9 Ma y 2008 28 pages Abstrat: The T elex system is designed for sharing m utable data in a dis- tributed en vironmen t, partiularly for ollab orativ e appliations. Users op erate on their lo al, p ersisten t replia of shared do umen ts; they an w ork dison- neted and suer no net w ork lateny . The T elex approa h to detet and orret onits is appliation indep enden t, based on an ation-onstrain t graph (A CG) that summarises the onurreny seman tis of appliations. The A CG is stored eien tly in a multilo g struture that eliminates on ten tion and is optimised for lo alit y . T elex supp orts m ultiple appliations and m ulti-do umen t up dates. The T elex system learly separates system logi (whi h inludes repliation, views, undo, seurit y , onsisteny , onits, and ommitmen t) from appliation logi. An example appliation is a shared alendar for managing m ulti-user meetings; the system detets meeting onits and resolv es them onsisten tly . Key-w ords: No k eyw ords ∗ This resear h is supp orted in part b y Respire (ANR, F rane, respire.lip6.fr ), Grid4All (FP6, EU, www.grid4all.eu ) and gran t JC2007-00213 (Spain). † INRIA, P aris-Ro quenourt, F rane ‡ LIP6, P aris, F rane § Univ ersitat Ob erta de Catalun y a, Barelona, Spain ¶ National T e hnial Univ ersit y of A thens, Greee T elex : un système de partage en ériture p our les appliations ollab orativ es, basé sur un mo dèle formel Résumé : Le système T elex est onçu p our le partage des données mo diables dans un en vironnemen t réparti, prinipalemen t p our des appliations ollab ora- tiv es. Les utilisateurs op èren t sur une opie lo ale et p ersistan te des do umen ts qu'ils partagen t ils p euv en t tra v ailler en mo de déonneté, et ne son t pas ra- len tis par la latene du réseau. T elex utilise une appro he indép endan te de l'appliation p our déteter et orriger les onits, qui se base sur un graphe ations-on train tes (A CG) qui résume la séman tique de onurrene des appli- ations. L'A CG est sto k é de façon eae dans une struture dite multi-journal qui élimine la on ten tion et est optimisée p our la lo alité. Des appliations dif- féren tes s'exéuten t sur T elex, qui p ermet de mettre à jour plusieurs do umen ts de façon o ordonnée. T elex sépare propremen t la logique système (e qui inlut la répliation, les vues, le undo , la séurité, la ohérene, les onits, et la nalisation) de la logique appliativ e. Un exemple d'appliation est un alen- drier partagé, p our gérer des réunions m ulti-utilisateur le système détete les onits de réunion et les résout de façon ohéren te. Mots-lés : P as de motlef T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 3 1 In tro dution The T elex system pro vides no v el solutions for write-sharing data in o-op erativ e and disonneted w ork settings. Existing approa hes ha v e sev ere limitations. F or instane state ma hine repliation [5 ℄ imp oses high lateny and do es not supp ort disonneted op era- tion. The p opular last-writer-wins algorithm [11 ℄ do es not ensure an y high-lev el orretness guaran tees. 1 In on trast, T elex is based on a prinipled approa h that om bines exibilit y and orretness, and leanly separates appliation logi from system logi. Appliation logi transmits to T elex ations (op erations) and onstrain ts (onurreny in v arian ts), and applies exeution s hedules transmitted b y T elex. In return, T elex tak es are of: repliation, onsisteny , storage and aess on- trol; olleting, transmitting and p ersisting op erations; deteting onits and omputing high-qualit y onit-free s hedules; forw ard exeution and rollba k; he kp oin ting; ommitmen t; and aess on trol. T elex supp orts m ulti-do umen t up dates and ross-appliation senarios out of the b o x. T elex is based on a prinipled approa h, the A tion-Constrain t Graph (A CG) [12 ℄. W e designed the multilo g data struture to store A CG-based do - umen ts in a distributed le system. Multilogs eliminate write on ten tion and promote lo alit y . W e dev elop ed a n um b er of demonstration appliations ab o v e T elex. F or instane, a shared alendar appliation lets p eople organise their agenda ollab- orativ ely , arranging priv ate ev en ts and group meetings. T elex detets meeting onits and prop oses p ossible solutions. The on tributions of this pap er inlude: a no v el approa h to shared data repliation that is appliation indep enden t y et appliation-a w are, the A CG; the pratial engineering of an A CG system, in partiular the do umen t and m ulti- log strutures; design examples and lessons learned for A CG-based appliations; and some b en hmarks and p erformane measuremen ts. This pap er pro eeds as follo ws. Setion 2 is an o v erview. Setion 3 explains the data strutures that T elex uses. Setion 4 do umen ts the T elex ar hite- ture and implemen tation. In Setion 5, w e presen t some example appliations. Setion 6 ev aluates the T elex p erformane. W e reet on lessons learned in Setion 7. Setion 8 ompares T elex with related w ork. Finally , Setion 9 on- ludes. 2 T elex o v erview W e giv e an o v erview of the T elex system from three omplemen tary p oin ts of view. 1 Setion 8 analyses the state of the art in detail. RR n ° 6546 4 Benmouok et al. a. App. reies user op. b. Remote ation rvd.: . Compute s hedule(s), as ations & onstrain ts up all for onit onstrain ts exeute, displa y Figure 1: T elex in terations. (The irled n um b ers refer to Figure 5 ) 2.1 User/appliation p ersp etiv e T elex supp orts p artiip ants , i.e., users w orking at disjoin t sites , whi h ma y b e widely distributed. An authorised partiipan t ma y repliate a shared do ument on his site. A site op erates optimistially [11 ℄: it applies lo al ations (op erations), sends them to other sites, and ev en tually r eplays the ations it reeiv es. Hene, appliations are not slo w ed do wn b y remote syn hronisation, net w ork issues, or b y remote failures. A partiipan t ma y w ork either onneted or disonneted from others. Th us, ea h partiipan t has his o wn view of the urren t state of the shared do umen t. Do umen ts and views p ersist aross log-out/log-in and restarts. Ho w ev er, a view is only ten tativ e and ma y ha v e to roll ba k. T elex, not appliations, tak es are of hard issues su h as onit detetion, reoniliation, and onsisteny . Ho w ev er, sine a onit is the violation of some appliation in v arian t, T elex is parameterised b y appliation-sp ei onurreny in v arian ts alled onstr aints . A onstrain t relates t w o ations, either of the same or distint do umen ts. Hene, T elex main tains onsisteny b et w een do umen ts. Figure 1 illustrates the on trol struture of T elex with a Shared Calendar (SC) appliation. 2 In this example, the partiipan t reates an app oin tmen t, whi h onits (double b o oking) with one reated remotely . In Figure 1 .a, the partiipan t p erforms the app ointment op eration. The SC appliation logs the orresp onding ations and onstrain ts to the lo al T elex dæmon ( +ation app ointment ). In Figure 1.b, when the site reeiv es a remote ation ( signal ), it ompares it to the onurren t ations. If T elex susp ets a onit, it alls up to the appliation ( getConstraint ), whi h replies with preise information ( +onstraint antagonism ). Finally , as in Figure 1., T elex p erio dially sends she dules to the appliation, for exeution and/or rollba k. The appliation 2 Elemen ts of the gure not disussed here will b e explained in later setions. INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 5 Name Notation Seman tis NotAfter A → B A is nev er after B in an y s hedule Enables A ⊳ B B in a s hedule implies A in same s hedule NonCommuting A / B Must agree on A → B or B → A (onit) A tomi A ⊳ ⊲ B All or nothing Causal A ⊳ → B B dep ends ausally on A A ntagonism A ← → B A and B nev er b oth in same s hedule (onit) T able 1: Constrain ts omputes and displa ys the orresp onding views, in this example with a onit indiation ( onit ). 3 2.2 F ormal p ersp etiv e: ations and onstrain ts T elex is based on a formal mo del, the A tion-Constrain t Graph (A CG) [ 12 ℄. The A CG is a lab elled graph whose no des are the ations and edges are the onstrain ts. The urren t view of a site is the result of exeuting a sound she dule , i.e., an ordering of ations urren tly kno wn at that site, that ob eys the safet y onstrain ts NotAfter and Enables . In eet, the A CG represen ts the set of all legal views. T able 1 presen ts briey the onstrain ts supp orted b y T elex; for full details please refer to the relev an t publiations [12℄. The rst three are primitiv e, the last three are om binations of the primitiv es. 4 These represen t imp ortan t lasses of onurreny in v arian ts. While they an appro ximate the true appliation seman tis only grossly , w e ha v e found that they are suien tly expressiv e for reoniliation purp oses in sev eral kinds of appliations [9 , 13 ℄. F ormally , ev en tual onsisteny requires that all s hedules b e sound, that they ha v e a ommon stable sound prex, that ev ery ation ev en tually b e either ab orted or in the prex, and that non-omm uting ations that are in the prex b e ordered. 5 The latter t w o items imply a global onsensus b et w een sites. W e all this onsensus the ommitment pr oto ol . In T elex, ommitmen t is optimisti, i.e., it o urs in the ba kground, not in the ritial path of appliations. 2.3 Engineering p ersp etiv e: m ulti-logs and ommitmen t The design of T elex is motiv ated b y some ma jor requiremen ts and hallenges: (i) P ersist and repliate the A CG. (ii) Pro vide strong guaran tees ab o v e a dis- 3 F or the purp ose of this pap er, do umen t state, view and s hedule are synon ymous. View emphasises that the state is lo al and is not unique; s hedule emphasises that it is omputed b y some ordering of a v ailable ations. 4 A tomi do es not ensure transational isolation; an isolation onstrain t will b e added in the future. Curren tly , to a hiev e isolation, the user m ust man ually group op erations in to a single ation. 5 Mutually-omm uting ations ma y run in an y relativ e order. RR n ° 6546 6 Benmouok et al. tributed le system with only b est-eort onsisteny . (iii) In tegrate do umen ts in to the le system, with reasonable o v erhead and salabilit y . (iv) Pro vide a- ess on trol, without violating onsisteny . (v) Remo v e old A CG en tries from storage. (vi) Deen tralised, p eer-to-p eer design, with supp ort for asual dison- neted op eration. A do umen t is a named en tit y in the le system. F or lo alit y , a do umen t stores only the p ortion of the A CG onsisting of the ations op erating on the do umen t, and their onstrain ts. T elex do umen ts o exist with ordinary les and diretories in the le system. Using one or the other is up to the appliation. T elex relies on external me hanisms to store and repliate do umen ts, and to propagate hanges to remote sites. T o a v oid le system b ottlene ks and onsisteny issues, ea h partiipan t writes to a distint app end-only lo g within a do umen t. T o enable inremen tal garbage olletion, the log is brok en do wn in to suessiv e h unk les. This struture is alled multilo g . A log is a suession of ations and onstrain ts in no partiular order. W e optimise for the exp eted ommon ase, where onstrain ts are inside the same log; in ter-log onstrain ts within the same do umen t are sligh tly more exp ensiv e. In ter-do umen t onstrain ts are assumed to b e relativ ely rare and are more ostly . Beause of net w ork dela ys and disonnetions, and b eause of ltering and aess on trol (explained later), at an y p oin t in time, dieren t partiipan ts ma y observ e dieren t A CGs. Ho w ev er, ea h partiipan t's view is onsisten t, b eause it results from a sound s hedule. Th us, if some ation A is not in a view, and A Enables B , then B is also not in that view. The urren t view an b e reorded in a snapshot . Snapshots name a view, sp eed up the omputation of later views, and help with garbage olletion. A deen tralised, ba kground ommitmen t proto ol ensures that the ommon prex of s hedules mak es progress. Ea h partiipan t an v ote for a s hedule aording, for instane, to user preferene. V oting is deen tralised and p eer-to- p eer. Committed log reords ma y b e deleted. Ho w ev er it ma y b e adv an tageous to retain them for auditing, reo v ery or seletiv e undo (to b e explained later). 3 Data strutures 3.1 Do umen t storage T elex stores its do umen ts in le systems with standard, b est-eort onsisteny guaran tees. The storage design ob eys some sp ei requiremen ts. Do umen ts should b e seamlessly in tegrated ab o v e a standard POSIX in terfae, with reason- able p erformane and salabilit y . They should o-exist with lassial les and INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 7 Figure 2: Storage of T elex do umen t diretories. P artiipan ts m ust b e able to w ork normally while disonneted. The system should sales w ell with the n um b er of ollab orating partiipan ts. Finally , P artiipan ts' data m ust b e seured ev en when shared. W e implemen ted m ultilogs ab o v e the federativ e p eer-to-p eer le system V OFS [1 ℄. V OFS pro vides global aess to les with b est-eort onsisteny . It supp orts disonneted op erations via p ersisten t repliation, and notiations for le mo diations on distributed les. A omplete desription of V OFS is outside of the sop e of this pap er; here w e fo us on sp ei features related to T elex in tegration. 3.1.1 Multilog Design As illustrated in Figure 2, a T elex do umen t is a strutured diretory of les. Appliations and T elex ma y store do umen t-sp ei data within the do umen t, su h as lters and snapshots. These data are lo al to a partiipan t; only the m ultilog needs to b e repliated. A m ultilog is itself strutured as a diretory that on tains an app end-only log p er partiipan t. A tions and onstrain ts reated b y an appliation are ap- p ended to that partiipan t's log. Ea h partiipan t's log is repliated at the other partiipan ts' sites; V OFS propagates the up dates to the net w ork. As ea h log has a single writer, is app end-only , and lo al to a do umen t, this a v oids write on ten tion and salabilit y issues. Propagation of a log through the net w ork is asyn hronous, i.e., a log replia ma y on tain only a prex of its soure, as indiated b y the syn bar in the RR n ° 6546 8 Benmouok et al. Figure 3: Implemen tation of m ultilogs o v er V OFS. gure. T elex instanes monitor the logs for new up dates. Ev en tually , all ations and onstrain ts are kno wn to all partiipan ts. As time passes, an ation ev en tually b eomes ommitted and is not needed an y more. T o enable remo ving su h old reords, a log is itself strutured as a diretory of h unk les. When the size of the urren t h unk rea hes a threshold, a new one is reated. The name of a h unk le inludes a sequene n um b er, making it on v enien t to read h unks in order, and to seletiv ely delete h unks. A h unk ma y b e deleted when all the ations it on tains are ommitted and there is a later materialised snapshot. This is, ho w ev er, a p oliy deision; a site ma y deide instead to retain old h unks for auditing or reo v ery . 3.1.2 Multilogs on V OFS A do umen t is stored b y the T elex dæmon in the le system as a diretory . The in ternal struture of this diretory is not meaningful to users, and is in tended to b e hidden b y the user in terfae (m u h lik e the bundles of MaOS). In our deplo y ed m ultilogs so far, w e ha v e used a en tralised setup at a primary master site, on taining the authoritativ e v ersion of all the logs in a do umen t. P artiipan ts' sites a he the logs p ersisten tly , making them a v ailable for disonneted op eration. The master site is a single p oin t of failure and a salabilit y b ottlene k. In the future, w e plan to use a p eer-to-p eer onguration, using aross- net w ork sym b oli links that V OFS pro vides. Here, ea h partiipan t hosts the authoritativ e v ersion of his o wn log on his o wn site, as in Figure 3. As b efore, partiipan ts a he remote logs p ersisten tly . The master site serv es only to list all the logs using sym b oli links. An y other metho d of distributing the list ould b e used. INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 9 Figure 4: T w o m ultilogs with their logs; note onstrain ts within log, within do umen t, and b et w een do umen ts 3.1.3 The Multilog T o olkit V OFS is optimised for m ultilogs, whi h impro v es the user exp eriene. Ho w ev er, m ultilogs an b e implemen ted ab o v e an y ordinary distributed le system. W e pro vide a to olkit implemen tation of m ultilogs, as a set of simple programs and dæmons, pro viding simple and eien t m ultilog managemen t and aess ab o v e an ordinary le system. The implemen tation follo ws losely the design of Figure 2. More details are a v ailable in Setion 6. 3.2 A tion and Constrain t An ation represen ts an appliation op eration. It is desrib ed b y sev eral at- tributes, of whi h some are kno wn to T elex and other are appliation-sp ei. Among the former, the most imp ortan t is a list of ation keys . An ation k ey indiates the do umen t subset that this ation targets; if t w o ations ha v e a ommon k ey , this indiates suspiion that the ations onit (see Setion 4.2.1 for more detail). An ation b elongs to only one do umen t. It is uniquely iden- tied b y the triple h do ument, issuer, timestamp i . T elex logs an ation in the log of the partiipan t who issues it. A onstrain t reies a seman ti relation b et w een t w o ations. It is dened b y its t yp e ( NonCommuting , NotAfter or Enables ) and b y the t w o ations it binds. A onstrain t is uniquely iden tied b y the triple h typ e, ation1, ation2 i . T elex logs a onstrain t in the log of the partiipan t who issues it. Most often, a onstrain t binds t w o ations of the same do umen t, whether issued b y the same partiipan t or not. Su h a onstrain t is alled an intr a- do ument onstrain t. Ho w ev er, a onstrain t ma y bind ations of t w o distint do umen ts. Su h a onstrain t is alled a r oss-do ument onstrain t. It is then logged in b oth do umen ts. RR n ° 6546 10 Benmouok et al. A onstrain t C referenes an ation A b y using one of the three follo wing forms: (timestamp) if A is issued b y the same partiipan t as C and b elongs to the same do umen t, (issuer, timestamp) if A b elongs to the same do umen t as C and (do Id, issuer, timestamp) otherwise. In the latter form, do Id is the id of the do umen t that ation A b elongs to. Figure 4 sho ws an example of the t w o t yp es of onstrain t. Constrain t C 1 is an in tra-do umen t onstrain t: it binds ations A 1 and A 2 of do umen t OSDI_p ap er . Constrain t C 1 is issued b y Pierr e and th us it is logged in Pierr e 's log of OSDI_p ap er . On the other hand, onstrain t C 2 is a ross-do umen t on- strain t: it binds ation A 3 of do umen t OSDI_p ap er and ation A 4 of do umen t gur e_1 . Constrain t C 2 is issued b y Ge or gios and th us it is logged in Ge or gios 's log of b oth OSDI_p ap er and gur e_1 . 3.3 Views A desirable feature of repliation in ollab orativ e w ork is to enable dieren t partiipan ts to ha v e their o wn view of a shared do umen t. F or instane a partiipan t w orking on a giv en setion of a shared do umen t ma y temp orarily ignore up dates to the same setion b y other partiipan ts. T elex allo ws the partiipan t to selet a partiular view of a do umen t b y means of ation lters . A lter denes whi h ations of the A CG T elex m ust exlude when omputing sound s hedules. When applying a lter, T elex also exlude all ations that ltered ations enable. This ensures that the view omputed b y ltering is alw a ys sound, i.e., do umen t in v arian ts are not violated. A partiipan t denes a lter b y sp eifying its name and one or more lter- ing riteria in v olving an y attribute of an ation. The partiipan t ma y dene sev eral lters on a do umen t and dynamially add and remo v e them. T elex sa v es urren tly-dened lters as part of the p ersisten t state of a do umen t. Note that a lter ma y target a sp ei ation of a do umen t. By adding and remo ving the lter, user ma y th us seletiv ely undo and redo the orresp onding ation in his view of the do umen t. (T o undo an ation p ersisten tly , the parti- ipan t m ust ab ort it. By on v en tion, this is expressed b y marking the ation as an tagonisti with itself.) Filters also pro vide a means to p ermanen tly exlude the op erations of a partiipan t who turns out to b e maliious, as in the Ivy le system [6 ℄. Con trary to Ivy , T elex lters main tain orretness, b y exluding all ations that dep ends on the maliious partiipan t's ations. 3.4 Snapshot A snapshot reords some view of the do umen t. T o dene a snapshot, a par- tiipan t sp eies its name and the she dule of ations whose exeution yields INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 11 Figure 5: T elex ar hiteture the state b eing reorded. In addition, the appliation ma y pro vide the or- resp onding binary state of the do umen t. In this ase, the snapshot is said materialise d . Materialised snapshots sp eed up the omputation of a view and are used as garbage olletion p oin ts. The partiipan t ma y dene an y n um b er of snapshots of in terest to him, and later remo v e those that are no longer useful. T elex sa v es the set of urren tly- dened snapshots as part the p ersisten t state of the do umen t. 4 T elex ar hiteture and op eration Figure 5 is a detailed view of Figure 1 whi h sho ws the o v erall ar hiteture of T elex. An instane of T elex runs at ea h site and omm uniates with remote sites. On top of the gure are the appliations using the servies of T elex. Sev eral su h appliations ma y run onurren tly at the same site. In the middle of the gure is the T elex system. It is omp osed of t w o main mo dules the s heduler and the replia reoniler la y ered on top of t w o auxiliary mo dules the transmitter and the logger. Arro ws in the gure represen t in v o ation paths b et w een T elex mo dules and to/from appliations. RR n ° 6546 12 Benmouok et al. Ea h appliation ma y op en one or more do umen ts. F or ea h op en do u- men t, T elex reates one instane of ea h mo dule, whi h main tains the exeution on text of the do umen t. The only exeption is when do umen ts are b ound b y ross-do umen t onstrain ts, as desrib ed in setion 4.2.3 . In this ase, the b ound do umen ts share the same instane of the replia reoniler and the s heduler. W e desrib e next the in teration b et w een a T elex instane and the outside w orld and then detail the op eration of the main mo dules. 4.1 In terations T elex-appliation in terations in v olv e ex hanging piees of A C graphs (sets of ations and onstrain ts do wn w ards, sets of s hedules up w ards). The in teration yle is as follo ws. The partiipan t ats up on the appliation, whi h translates his request in to one or more ations and onstrain ts and passes them to T elex. In return, T elex omputes a sound s hedule from the set of lo ally-kno wn ations and onstrain ts and hands the s hedule to the appliation. The appliation exeutes the s hedule and presen ts the resulting state to the partiipan t. If some ations onit, then sev eral sounds s hedules exist, ea h orresp onding to a p ossible solution to the onit. The appliation presen ts the resulting states to the partiipan t so that he an selet the solution he prefers. T elex sites ex hange ations and onstrain ts through m ultilogs, and om- m uniate with ea h other in the ommitmen t proto ol. The logger mo dule logs the ations and onstrain ts submitted b y the lo al partiipan t in the partii- pan t's log. In return, the V OFS noties the logger when remote partiipan t's log are up dated. The transmitter determines the set of p eer sites and pro vides an A tomi Multiast servie among p eer sites (arro ws #9 and #10). 4.2 S heduler The role of the s heduler is t w ofold. First, it main tains the in-memory A CG that represen ts the state of the do umen t at the lo al site. Seond, it p erio dially omputes sets of sound s hedules from the A CG and prop oses them to the appliation for exeution. A tions and/or onstrain ts are added to the graph either b y: The appliation (Figure 5 , arro w #1), when the lo al partiipan t up dates the do umen t. The logger (arro w #2), when it reeiv es an up date issued b y a remote partiipan t. The replia reoniler (arro w #3), when it ommits a s hedule. The s heduler passes lo ally-submitted ations and onstrain ts to the logger (arro w #4) to log them on p ersisten t storage. INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 13 4.2.1 Cross-site onstrain t generation A tions logged indep enden tly b y t w o partiipan ts ma y onit; for instane in the shared alendar appliation, a same user ould b e added to t w o parallel meetings. T elex ensures that onits are reied b y onstrain ts as follo ws. When a site reeiv es a new ation, it ompares it against already-kno wn, onurren t ations of the same do umen t. If they ha v e a ommon k ey , then T elex in v ok es the orresp onding appliation's getConstr aint up all. If the ations really onit, the appliation resp onds b y logging an appropriate onstrain t (arro w #5 in Figure 1 .b or Figure 5 ). A tion k eys are opaque to T elex, whi h tests them for equalit y only . A tion k eys serv e as a ompat, but appro ximate, represen tation of the do umen t subset that the ation uses or up dates. T ypially , an ation k ey hashes the iden tier of a parameter of the ation. Multiple k eys ha v e or seman tis (T elex up alls getConstr aint if a k ey of one ation equals an y k ey of the other). T o implemen t and seman tis (for instane, to get an up all only if t w o giv en ob jets are in v olv ed) the appliation hashes the X OR of their iden tiers in to a single k ey . An ation with no k eys onits with no other. If t w o unrelated ations happ ens to ha v e equal ation k eys, no harm is done, other than a loss of p erformane. 4.2.2 S hedule generation A large n um b er of sound s hedules exist for an y giv en A CG in the general ase. It is therefore not feasible to ompute all sound s hedules b eforehand and presen t them to the appliation. Besides, the appliation ma y b e in terested only in a few or ev en just one s hedule. F or these reasons, T elex generates sound s hedules dynamially , up on appliation request (this is not sho wn in Figure 5). The appliation ma y th us iterate through the prop osed s hedules and stops when one or more appropriate s hedules are found. T elex generates the b est s hedules rst, where the qualit y metri is the n um b er of ations inluded (implying few er ations ab orted). Optimal s hedul- ing is NP-omplete, therefore T elex runs a heuristi inspired b y IeCub e [9℄. Seondary goals of the heuristi are to giv e preferene to ations of the lo al partiipan t in the ase of a onit, and to a v oid returning a s hedule equiv alen t to one returned previously . 4.2.3 Bound do umen ts T w o do umen ts are said b ound if there exists a onstrain t b et w een an ation of one and an ation of the other, and either ation (or b oth) is not ommitted. F or instane, if a partiipan t wishes to up date t w o do umen ts atomially , he sets an Enables onstrain t in ea h diretion b et w een the up dates. RR n ° 6546 14 Benmouok et al. The ations of a do umen t ma y not b e s heduled indep enden tly from those of the do umen ts it is b ound to. S heduling is optimised for the ommon ase of non-b ound do umen ts, but w e pro vide sp eial pro essing for this partiular ase. Note that b ound do umen ts ma y b e handled b y distint appliations. T elex pro esses b ound do umen t b y merging them in to a single shar e d A CG in order to ompute glob al s hedules o v er all ations and onstrain ts. Ea h global s hedule generally on tains ations from all b ound do umen ts. Th us, in order to exeute a global s hedule, T elex rst pro jets the s hedule on ea h do umen t and passes ea h resulting sub-s hedule to the relev an t appliation. The pro jetion op eration simply onsists in retaining only those ations that b elong to the target do umen t while preserving their order. T elex assigns the same iden tier to the sub-s hedules deriving from the same global s hedule. This w a y , the partiipan t an iden tify mat hing sub-s hedules on ea h b ound do umen t. 4.3 Replia reoniler Ea h T elex site prop oses a set of onstrain ts, a pr op osal , to remote sites. A pro- p osal on tains deision to ommit, ab ort or serialise ations. These prop osals ma y dier, due to asyn hronous omm uniation, ltering, diering lo al infor- mation, or user preferene. The r epli a r e oniler is in harge of ommitment , i.e., rea hing agreemen t on a ommon s hedule prex. Commitmen t o urs in the ba kground, not within the ritial path of appliations. The ommitted prop osal app ears as a prex of the lo al s hedules. W e prop ose a plug-in replia reoniler ar hiteture, pro viding dieren t strategies aording to needs. A reoniler has four (asyn hronous) phases. 1. Ea h sites ompute a prop osal, aording to its lo al view, for instane based on the user's preferenes (arro w #8 in Figure 5). 2. The transmitter atomi m ultiasts prop osals to set of sites diretly on- erned (arro w #9) b y the agreemen t (in ase of b ound do umen ts more than one replia group ma y b e onerned). A tomi m ultiast main tains liv eness in presene of faults and net w ork lags. 3. The transmitter forw ards prop osals it reeiv es up to the replia reoniler (arro w #10). 4. A ording to the ommitmen t algorithm (desrib ed next) the reoniler ho oses a winning prop osal, and logs it (arro ws #3 and #4). Curren tly w e prop ose t w o ommitmen t algorithms. (i) A rst-in rst-out algo- rithm for appliations su h as a distributed database. A t ea h site the FIF O algorithm prop oses to minimise the n um b er of dead ations aording to its lo al view. When a site deliv ers a new prop osal, the FIF O algorithm he ks the soundness of the prop osal aording to the previous winning prop osals (arro ws #8 and #7). If the deision is sound, the reoniler adds it to the A CG, if not the deision is disarded. INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 15 (ii) A v oting algorithm that tak es in to aoun t lo al preferenes. A prop osal is a v ote spanning one or m ultiple ations o v er one or more do umen ts. A prop osal is brok en in to sub-A CGs with sp ei prop erties, alled andidates. Candidates on taining the same ations hallenge ea h other. A andidate ma y b e eleted only if its set of ations is transitiv ely losed in the union of all the A CGs aross sites. This proto ol is desrib ed in detail in a separate publiation [15 ℄. 4.4 A ess on trol The T elex design inludes aess on trol at inreasingly ne-grain lev els, using a seurit y framew ork (whose desription is out of sop e of this do umen t). This is indiated b y the three arro ws mark ed hek in Figure 1 . (i) A ess on trol at le gran ularit y ensures that a single partiipan t writes a giv en log, and that only authorised users an read a log. (ii) The T elex dæmon he ks whether a user is allo w ed to aess an individual log reord. 6 (iii) Appliations ma y enfore further on trol. F or instane, in the SC appliation, a user migh t observ e the times that another user is busy , but not b e allo w ed to see the other details of his meetings. As explained in Setion 2.3 , aess on trol do es not violate onsisteny . 5 Appliations T o pro vide insigh t on the issues in v olv ed in using the T elex system, this setion presen ts some of our example appliations. W e will return to the lessons learned in a later setion. 5.1 Simple Repliated Ditionary W e start with a simple example. Our Simple Repliated Ditionary Appliation (SRD A) manages shared ditionaries. SRD A is in tended as a building blo k for appliations su h as a shared address b o ok. Users an op erate on a ditionary in either onneted or disonneted mo de. T elex guaran tees that, in spite of no de arriv als, departures or failures, all instanes of a giv en ditionary on v erge. A do umen t on tains tuples of the form h tupleID , attribute 1 , attribute 2 , . . . i , for an y n um b er of attributes. Ea h attribute is a h name, value i pair. SRD A pro vides these op erations: insert ( tupleID , attrs ) : inserts a new en try , with iden tier tupleID and attributes attrs , in to the ditionary do umen t. mo dify ( t upleID , attrs ) : mo dies attributes for the giv en tupleID . 6 This is not y et implemen ted in the urren t v ersion. RR n ° 6546 16 Benmouok et al. insert ∀ previous r em i . TID : r em i . TID → urren t ins . TID r emove ins . TID ⊳ → urren t r em . TID mo dify ins . TID ⊳ → urren t mo d . TID ∀ previous mo d i . TID . attr j : mo d i → urren t mo d T able 2: Sequen tial exeution onstrain ts (Notation: ins = insert , mo d = mo dify , r em = r emove , attr = attribute , TID = tupleID ) r emove ( tupleID ) : deletes the tuple orresp onding to the giv en tupleID . r e ad ( t upleID ) : returns the attributes orresp onding to the giv en tupleID . In the rst op eration, the tupleID m ust b e previously un used or remo v ed; for all the others, a tuple iden tied b y tupleID m ust already exist. The mo d- ify op eration assigns the listed attributes if they already exist for the tuple, otherwise it adds them. Insert, mo dify and remo v e op erations translate to a T elex ation. Beause T elex do es not y et supp ort isolated m ulti-op eration transations, w e manage write dep endenies in the write op erations, as explained shortly . Read op era- tions are treated as lo al. 5.1.1 Sequen tial onstrain ts T able 2 summarises the sequen tial seman tis of SRD A. SRD A logs these on- strain ts at the same time as it logs the righ t-hand ation of the onstrain t. In the T elex design, the appliation should log ausal dep endene only when the seond ation truly dep ends on the rst. Hene, a mo dify ation, or a r emove , is ausally dep enden t on the insert that reated the tuple. Th us, if the insert ab orts or fails, the dep enden t mo dify and r emove ations will b e disarded from an y sound s hedule. F urthermore, w e treat ev ery write op eration as a read-ompute-write transation. In order to ensure read-y our-writes session guaran tees [16 ℄, w e set NotAfter onstrain ts b et w een insert , mo dify and r emove ations in the same user session, ev en b et w een dieren t ditionary do umen ts. Finally , to ensure the orret s heduling of a r emove follo w ed b y an insert with the same tuple iden tier, w e mak e all previous r emove with the same tuple-id NotAfter the urren t insert . The SRD A appliation logs the ab o v e onstrain ts in the m ultilog, at the same time as it logs the righ t-hand ation. The SRD A appliation logs the ab o v e onstrain ts in the m ultilog, at the same time as it logs the righ t-hand ation. INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 17 ins 2 mo d 2 ins 1 ins 1 . TID = ins 2 . TID ⇒ ins 1 / ins 2 mo d 1 . TID = mo d 2 . TID ∧ mo d 1 imp ossible attrs 1 . TIDs ∩ attrs 2 . TIDs 6 = Ø ⇒ mod 1 / m o d 2 T able 3: SRD A getConstr aint 5.1.2 Conurreny onstrain ts Sine it is illegal to insert the same iden tier t wie, t w o onurren t insert ations that refer to the same iden tier are NonCommuting . Otherwise, onurren t inserts omm ute. Similarly , t w o onurren t mo dify op erations with the same iden tier and o v erlapping attributes are also NonCommuting . Those onstrain ts are added b y the appliation when T elex in v ok es its getConstr aint metho d. They are summarised in T able 3, where NonCommuting is noted / . In order to ensure that T elex up alls the getConstr aint metho d as needed, insert and mo dify ations ha v e an ation k ey , omputed as a hash of the tupleID . 5.2 Shared Calendar Our Shared Calendar (SC) appliation is represen tativ e of ollab orativ e deision- making appliations. SC illustrates the adv an tages of T elex for seman tially-ri h ollab orativ e appliations. SC helps p eople organise priv ate ev en ts and group meetings ollab orativ ely , p ossibly in disonneted and asyn hronous mo de. Con trary to existing alendar appliations, SC detets onits (su h as double b o oking), prop oses solutions, and ensures agreemen t and ev en tual onsisteny . This w ould b e diult to a hiev e without T elex supp ort. Appliation logi (i.e., main taining the data strutures and iden tifying onstrain ts) is w ell sepa- rated from the system logi, i.e., p ersistene, repliation, onit detetion and resolution, ommitmen t, et. 5.2.1 SC logi Ea h user or lo ation has an asso iated alendar do umen t. Ea h event (e.g., a meeting) is a separate do umen t. A alendar ma y b e read or up dated b y other users, who an (if so authorised) reate or manage ev en ts, in vite p eople to an ev en t, or iden tify onits and free time. W e use the follo wing notations. An ev en t e is unique, has a name e.name , and a date e.date , and is materialised b y a T elex do umen t e.dox . A user A reates an ev en t e b y reating the do umen t e.dox , and b y logging an op en-event ation in his o wn alendar and an invite(A) ations in e.dox . He RR n ° 6546 18 Benmouok et al. Figure 6: Exeution senario for the Shared Calendar appliation also logs an enable-event ation in e.dox that sym b olises the reation of the ev en t. This ation is used to sp eify onstrain ts on the ev en t reation as sho wn next. Later, user A ma y in vite other users b y logging an op en-event ation in his log within their alendars, and a orresp onding invite ation in e.dox . One a user has op ened an ev en t do umen t, he ma y in vite more users. He also an anel the ev en t or some user in vitation b y logging a an el-event or a an el-invitation ation in e.dox . 7 The ation k eys iden tify the ev en t and its time-slots. Therefore, ations in the same alendar for the same ev en t, or for dieren t ev en ts at the same time, will ha v e o v erlapping k eys, ausing T elex to in v ok e the getConstr aint up all in terfae of SC. A alendar do umen t ation omm utes with all other alendar do umen t ations. Constrain ts b et w een ev en t do umen t ations are similar to the SRD A onstrain ts, where enable-event , an el-event and invite (or an el-invitation ) are lik e lik e insert , r emove and mo dify resp etiv ely . T o a v oid double b o okings, onurren t invite ations are an tagonisti, if they onern the same user at the same time but dieren t ev en ts. 5.2.2 Use ase Consider the senario in Figure 6. Users Jean-Mi hel, Lamia and Mar are w orking separately and omm uniate only via the SC appliation. 7 Curren tly it is not p ossible to ollab orativ ely hange the time of an ev en t. This will require extensions to T elex to asso iate the time up dates with some user in vitation to detet a double b o oking, whi h is future w ork. INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 19 Figure 7: Mar's site at t 3 Jean-Mi hel organises meeting Net w orking Seminar NS with Mar. He pro- p oses t w o alternativ e dates, Monda y and T uesda y (Op eration 1 in the gure). Lamia also organises a meeting Greek Lesson GL with Mar on Monda y (Op eration 2). SC reates the ev en t do umen ts and logs the ations and onstrain ts to T elex, as detailed in Figure 7, depiting the state of Mar's site at time t 3 . Lamia's SC instane reates GL.dox do umen t, imp orts Mar's alendar, and logs the follo wing ations: On Mar's and Lamia's alendar: op en-event (e2) . On GL.dox : A = enable-event , B = invite(L amia) , C = invite(Mar ) . SC groups them atomially: A ⊳ ⊲ B ∧ B ⊳ ⊲ C . T o express the alternativ e Jean-Mi hel's SC instane transparen tly reates t w o ev en ts NS1 and NS2 with oniting enable-event ations. F or b oth ev en ts, SC generates similar ations as for the GL ev en t. Supp ose that, at some p oin t in time t 1 , Mar has reeiv ed Jean-Mi hel's ations, but not y et Lamia's. This ma y happ en, for instane, if Lamia is w orking oine. T elex omputes the s hedules orresp onding to t w o p ossible solutions: (i) holding NS on Monda y and ab orting NS on T uesda y; or (ii) holding NS on T uesda y, and ab orting NS on Monda y. Sine the former solution on tains more ations, it will b e prop osed rst. RR n ° 6546 20 Benmouok et al. Later, at t 2 Mar kno ws Lamia's ations. T elex he ks the k eys of Lamia's ations with Jean-Mi hel's. C = invite(Mar ) on GL.dox and E = invite(Mar ) on NS1.dox b oth ha v e a k ey represen ting the Monday slot. Therefore, T elex asks SC for the orresp onding onstrain ts. SC returns an an tagonism onstrain t C ← → E . This ensures that no view on tains b oth C or E , and that one or the other (or b oth) ev en tually ab ort. Finally , T elex oers the t w o p ossible solutions: (i) NS on T uesda y and GL on Monda y, ab orting NS on Monda y; or (ii) NS on Monda y, ab orting GL on Monda y and ab orting NS on T uesda y. Lamia is not in vited to ev en t NS , she ma y not read NS1.dox nor NS2.dox . Nev ertheless, T elex ensures that she ev en tually gets notied of a onit o ur- rene that ma y ab ort GL . The same go es for Jean-Mi hel. The reoniliation phase ensures that Mar, Lamia and Jean-Mi hel ev en tually see a onsisten t state for GL and NS ev en ts. 5.3 Shared wiki F or la k of spae, w e desrib e our Shared Wiki Appliation (SW A) only briey . Ea h wiki page is a separate do umen t. Ev ery user urren tly editing it has a log in the do umen t. His site k eeps a lo al replia of the wiki text, whi h the user mo dies lo ally using a standard text editor. Ev ery time the user sa v es, the SW A omputes the dierene from the previous v ersion, and translates it in to insert-line and delete-line ations. Mo difying a line is in terpreted as an atomi grouping of delete-line and insert-line. The SW A uses the W OOTO op erational transformation algorithm [7℄ to ensure that onurren t edit op erations omm ute. A delete-line ation dep ends ausally on the ation that inserted the line. Inserting a line b et w een t w o other lines dep ends ausally on the t w o orresp onding line insert op erations. Sine all onurren t op erations inside a do umen t omm ute, there will nev er b e an y onits. Therefore, edit ations arry no k eys, and T elex nev er up alls getConstr aint to the SW A. S hedule omputation is trivial, sine all s hedules that are ompatible with ausal dep endene order are equiv alen t. Existing wiki editors main tain the set of past v ersions of a page. Thanks to T elex, SW A an reonstrut an y past v ersion, and additionally main tains the relations b et w een v ersions. In the future, w e ould extrat more history information from the p ersisten t m ulti-log, inluding page splits and merges, and op y-paste b et w een pages. F rom the p ersp etiv e a single page, T elex serv es mainly to reliably broadast ations and repla y them in ausal order. One added v alue of T elex for SW A is the abilit y to p erform m ulti-do umen t up dates, e.g., a global replae through all wiki pages onsisten tly . T elex also enables m ulti-appliation senarios, e.g., ensuring that a wiki page on tain the details of a meeting agreed in the shared alendar appliation. INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 21 Cong Name 1x8M 8x8M 1x8L 8x8L W riters 1 8 1 8 Log size (MB) 50 50 5 5 RX limiting no no y es y es run time (se) 3.4 9.3 306.48 309.31 a vg RX+TX (B/s) 102.9M 75.3M 228.4K 226.3K T able 4: Represen tativ e results for shared m ultilogs with 1 and 8 writers, with and without limiting reeiving tra 0 1000000 2000000 3000000 4000000 5000000 0 20 40 60 80 100 120 Log size (bytes) Time (seconds) Log Propagation Progression: Site-3 view, 8 writers, 4 disconnect, limit incoming site-0 site-1 site-2 site-3 site-4 site-5 site-6 site-7 Figure 8: Multilog repliation progression for 8 writers, throttled inoming traf- . 4 disonnet. 6 P erformane ev aluation 6.1 Multilog exp erimen t The m ultilog to olkit is a simple set of to ols and dæmons that reate, aess and onnet logs in m ultilogs. It is written in Python and uses TCP/IP for net w orking. It straigh tforw ardly implemen ts the design illustrated in Figure 2 . There are four main utilities in the to olkit. L o gServer monitors a log and propagate up dates. L o gClient on tats a list of LogServ ers and lo ally repliates their logs. L o gT o ol is a utilit y that an read or write a log. Multilo gD is a simple dæmon that giv en a list of partiipan ts, om bines the log-to ols to implemen t a m ultilog. RR n ° 6546 22 Benmouok et al. 6.1.1 Ev aluation summary The m ultilog struture deouples reads and writes and promotes mostly-linear aess patterns. Therefore, the read/write p erformane of m ultilogs is domi- nated b y the lo al lesystem and of the net w ork sta k. The purp ose of this ev aluation is to demonstrate this fat; the results are summarised in T able 4 Our p erformane goals are to sale to v ery large n um b ers of readers. The n um b ers of writers for a single do umen t is exp eted to remain relativ ely small, on the order of tens of partiipan ts. This is t ypial for the in ternet so iet y . Eien t propagation from a small n um b er of writers to a h uge n um b er of readers is p ossible in p eer-to-p eer net w orks, where reipien ts of data propagate them further. The net eet of su h a solution is a high outgoing bandwidth and limited inoming bandwidth. In some of our exp erimen ts, w e em ulate this eet b y sev erely limiting inoming tra of partiipan ts while lea ving outgoing tra unlimited. 6.1.2 Detailed Results The exp erimen tal setup in v olv es one partiipan t installed on ea h of 8 no des in teronneted with Gigabit Ethernet. The senario is simple; Either one or all 8 partiipan ts b egin to log a sp ei amoun t of data as fast as p ossible. A t the same time, ea h partiipan t reads his logs and reords its repliation progression o v er time. The writers and readers are implemen ted with LogT o ol instanes, logs are serv ed b y LogServ ers and propagated up dates are reeiv ed and written to replias b y LogClien ts. T able 4 lists represen tativ e results for running 1 and 8 onurren t writers b oth with and without limiting the inoming tra. The a v erage tra is the sum of the inoming and outgoing tra om bined. Our onlusion is that, when there is no limit in eet, m ultilog propagation p erformane is omparable to the maxim um net w ork bandwidth. When limits are in plae, although o v erall bandwidth drops as exp eted, w e observ e that v arying the n um b er of writers b et w een 1 and 8 has no eet. F urthermore, in all the exp erimen ts, disonnetion of a partiipan t do es not disrupt the remaining ones, as illustrated in Figure 8 . 6.2 Syn theti b en hmarks Sound s hedules omputation T elex omputes sound s hedules using the IeCub e algorithm [9℄. F or a randomly generated graph on tainning 10000 a- tions and 20000 onstrain ts, our algorithm omputes a sound s hedule in 200 ms. In running mo de T elex uses inremen tal mo de, and the omputation is around a milliseond. INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 23 Reoniliation time W e test the time to deide newly prop osed ations. During this exp eriene w e ompute a s hedule ev ery 100ms, and a prop osal ev ery 100ms. Ea h site submits 20 ations p er seond. The a v erage time to ommit an ation using the FIF O algorithm (see Setion 4.3) is 64ms. 6.3 STMBen h W e run the STMBen h7 b en hmark [ 2℄, whi h em ulates an appliation with a ri h data struture and man y dieren t op erations. W e hose STMBen h7 mainly b eause it demonstrates onurreny and onits. It also serv es as an illustration of the use of T elex on a omplex data struture. STMBen h7 w as dev elop ed to exerise soft w are transational memories, based on the previous OO7 b en hmark for ob jet-orien ted databases. STM- b en h7 builds an ob jet graph with millions of ob jets and onneted b y n u- merous p oin ters. It on tains 45 op erations (21 read-only , 24 read-write) with v arious sop e and omplexit y . W e p orted to T elex the read-write op erations only . They all op erate in a similar manner: tra v erse the data struture, reading one or man y attributes of one or man y ob jets, and mo dify an ob jet. An STMBen h7 b en hmark onsists of t w o phases: reating a randomised ob jet graph, and in v oking op erations. W e measure only the seond phase. There are four four main ategories of op erations: Long tra v ersal: aess large parts of the ob jet graph, t ypially all as- sem blies and atomi parts. Short tra v ersals: aess few er ob jets, tra v ersing the graph along a ran- domly hosen path. Short op erations: ho ose a small n um b er of ob jets, and p erform an op- eration on these ob jets or in their neigh b ourho o d. Struture mo diations: randomly reate or delete ob jets, or reate or delete p oin ters b et w een ob jets. Ea h STMBen h7 op erations is mapp ed to a single ation, hene will b e isolated from onurren t op erations. Unexp etedly , in the original o de, op erations alw a ys omm ute, b eause the up dates either sw ap t w o shared p oin ters, or add 1 mo dulo 2 to a shared in teger. W e therefore mo died the b en hmark so that, with some probabilit y , up dates either omm ute or do not omm ute. Due to the large n um b er of op erations, w e will not presen t a omprehen- siv e list of onstrain ts. Instead, w e explain the rules w e follo w to dene the onstrain ts. An y mo diation to an ob jet is ausally dep enden t on the reation of the same ob jet. T w o ations that mo dify the same data are NonCommuting . RR n ° 6546 24 Benmouok et al. Num b er of sites Time to b en hmark (s) 1 20 2 21 3 21 4 21 5 21 6 21 T able 5: STMBen h7 results If an ation reads some data, and another ation onurren t writes the same data, the former is NotAfter the latter. This ensures that, at all sites, the read will see the v alue b efore the write. The results of the b en hmark are sho wn in T able 5, exeuting the op erations that mo dify data (not the struture). P erformane is indep enden t of the n um b er of sites. 7 Lessons learned Exp eriene with appliations and b en hmarks has giv en us useful feedba k, b oth regarding the implemen tation of T elex, as w ell as guidelines for appliation dev elop ers. The urren t implemen tation of T elex suers from exessiv e memory on- sumption. The A CG an qui kly rea h sizes of sev eral tens of thousands of no des, and is aessed onurren tly b y man y threads. F or instane, the s hed- uler parses the A CG at the same time as lo al and remote appliations are mo difying it. T o a v oid onurreny issues, the s heduler tak es a full op y of the urren t A CG, whi h b oth onsumes memory and is slo w (in Ja v a). Similarly , forw ard exeution and rollba k of appliations in v olv es op ying their in ternal state, whi h an b e v ery large. In b oth ases, an ob vious solution (and future w ork) is to op y-on-write instead. T ranslating appliation seman tis in to ations and onstrain ts is a skill that tak es time to aquire. W e presen t some guidelines deriv ed from our o wn exp e- riene. Note that these are not hard rules, and ev en ma y b e oniting. The most imp ortan t suggestion is to lev erage omm utativit y as m u h as p ossible. As noted in the SW A, if all op erations omm ute, onsisteny is trivial. The SW A example also sho ws that, sometimes, op erations that app ear non- omm uting in tuitiv ely , an b e designed or transformed to omm ute. W e learned that it is imp ortan t to turn ev ery piee of shared information in to a separate do umen t. In the initial design of SC, alendars w ere the only do umen ts, and ev en ts w ere impliit in the alendars. This raised a n um b er INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 25 of problems, b eause there w as no ob vious w a y to detet when a meeting on- it w ould impat another user indiretly . Separating out ev en ts as distint do umen ts solv ed this. It is imp ortan t to distinguish the sequen tial onstrain ts (mainly , NotAfter and Causal ) from the onurreny onstrain ts (onits). The former are logged with their righ t-hand ation; the latter are logged in resp onse to getConstr aint . Conurreny onstrain ts are deriv ed from the appliation in v arian ts. F or in- stane, in SRD A, the sequen tial sp eiation forbids t w o tuples with the same iden tier; it follo ws that onurren t insert s with the same iden tier are in A ntagonism . One lesson from STMBen h7 is to reason ab out high-lev el op erations rather than lo w-lev el ones, in order to deal with few er om binations. F urthermore, it is sometimes the ase where high-lev el op erations omm ute (for instane, inremen t and deremen t a shared in teger) ev en though their lo w-lev el imple- men tations (e.g., reads and writes) do not. Ho w ev er, in some ases, it ma y b e simpler to reason ab out a small n um b er of lo w-lev el primitiv es when they ma y b e om bined in to a large n um b er of op- erations. Curren tly , this kind of approa h is ompliated b y the la k of supp ort for transational isolation, whi h is future w ork. Constrain ts are hard to v alidate. W e suggest t w o omplemen tary approa hes for future w ork. A ompiler ould generate ations and onstrain ts from a high-lev el sp eiation, and a he k er ould v erify that all ation-onstrain t om binations v erify the appliation in v arian ts. 8 Related w ork State-ma hine repliation [5℄ is based a total order of op erations. This ensures onsisteny and orretness, but requires onsensus at ea h op eration, in the ritial path of the appliation. In on trast, T elex's optimisti approa h p er- forms onsensus in bat hes, in the ba kground. Optimisti repliation [11 ℄ has b een widely used, e.g., in repliated le sys- tems (for instane, Co da [3 ℄ or Roam [10 ℄) and for ollab orativ e w ork (e.g., Ba y ou [17 ℄). In these systems, replias ev en tually on v erge, but they generally do not ensure an y high-lev el orretness. F or instane, the widely-used last- writer-wins (L WW) loses up dates when onits o ur, and do es not main tain onsisteny b et w een ob jets. Our onstrain ts additionally ensure that applia- tion in v arian ts are preserv ed. Man y repliated systems transmit new v alues or deltas (the state-based mo del). The op eration-based mo del used in T elex (i.e., the system stores, transmits and repla ys logs of op erations) retains more useful information for reoniliation. This is esp eially adv an tageous when high-lev el op erations log- ially omm ute despite reading and writing the same ph ysial data, as in our SC and SW A appliations. RR n ° 6546 26 Benmouok et al. The literature on omputer-supp orted o-op erativ e w ork is widely based on op erational transformation (OT) [ 14 ℄. OT ensuring omm utativit y b et w een onurren t op erations b y mo difying them at repla y time. Com bined with reli- able ausal-order broadast, this ensures on v ergene with no further onur- reny on trol, but unfortunately OT app ears limited to v ery simple text-editing senarios. T elex tak es adv an tage of omm utativit y when it is a v ailable, and supp orts an y mix of omm utativ e and non-omm utativ e op erations. Co da's appliation-sp ei resolv ers [4℄ or Ba y ou [17 ℄ giv e appliations full on trol o v er onits. Ho w ev er, this requires dev elop ers to ha v e a deep under- standing of distributed systems issues. Instead, T elex requires st ylised onur- reny onstrain ts from appliations and tak es are of onit resolution in an appliation-indenden t manner. T elex has man y similarities with Ba y ou [ 17 ℄ and also man y dierenes. Ba y ou is an op eration-based system that pro vides ommitmen t; the ommit- ted state is guaran teed orret. Ho w ev er, Ba y ou relies on a primary site for ommitmen t and the ommitted s hedule is unpreditable. F urthermore, the system oers no help for reoniliation. Constrain ts w ere used for reoniliation in the IeCub e [9 ℄ system. IeCub e relies on a primary site for ommitmen t. In T elex, ea h site runs an IeCub e engine (or an y alternativ e) to prop ose s hedules, and the ommitmen t proto ol ensures onsensus based on these prop osals. IeCub e supp orts a ri her set of onstrain ts and an extrat them from the appliations' soure o de [ 8 ℄. The Ivy p eer-to-p eer le system [6℄ reoniles the urren t state of a le from single-writer, app end-only logs. There are sev eral dierenes b et w een Ivy and T elex. Ivy is designed for onneted op eration. Ivy is state-based and reoniles using a p er-b yte L WW algorithm b y default. Whereas T elex lo alises logs p er do umen t, in Ivy there is a single global log for all the up dates of a giv en partiipan t. Reading an y le requires sanning all the logs in the system, whi h do es not sale w ell, although this is oset somewhat b y a hing. Ivy has no ommitmen t proto ol, therefore a state ma y remain ten tativ e indenitely . The Ivy authors suggest that maliious up dates an b e remo v ed after the fat, b y ignoring the orresp onding log. Ho w ev er, sine Ivy do es not reord onstrain ts, it annot reonstrut a orret state: for instane, an up date b y an inno en t user that dep ends on a previous but maliious up date annot b e remo v ed. 9 Conlusion W e presen ted the T elex system for shared m utable do umen ts in a distributed system. W e presen ted our motiv ations, its formal priniples, the engineering design and implemen tation, and a n um b er of protot ypial appliations. W e also pro vided some p erformane measuremen ts. INRIA T elex: Priniple d Sys. Supp ort for W rite-Sharing in Col lab. Apps. 27 Our t w o main inno v ations are our prinipled approa h based on ation- onstrain t graph, and the m ultilog struture. The former enables T elex to pro vide orretness guaran tees while main taining appliation onurreny in- v arian ts. It also allo ws a lear separation b et w een the resp onsibilities of appli- ations, and those of the system. Thanks to onstrain ts, appliations sp eify preisely the lev el of onsisteny that they need, and the system enfores that lev el eien tly , and no more. Indep enden tly of the A CG, w e argue that the m ultilog struture is b etter adapted to shared, m utable do umen ts than ordinary les, esp eially in a ollab- orativ e en vironmen t. A le system ma y pro vide guaran tees for diretories, but generally only b est-eort onsisteny for les. F urthermore, the design goals of a le system are lik ely to b e dieren t from the needs of atual appliations. The m ultilog struture deouples reads and writes, a v oids on ten tion, en- ourages lo alit y , and allo ws eien t linear aess. Soft w are at a higher lev el in terprets the logs to reonstrut the appliation state. In our ase, this is T elex, but it ould b e the appliation diretly . Multilogs do not imp ose an y unneessarily limitations. T elex is op en soure soft w are, a v ailable at gfo rge.inria.fr/p rojets/telex2 . A kno wledgmen ts W e thank Abhishek Gupta, of Indian Institute of T e hnology Gu w ahati, for implemen ting m ultilogs during an in ternship at INRIA and for authoring the SW A appliation, and Zenon P erisé, of Univ ersitat Ob erta de Catalun y a, for authoring the Collab orativ e En vironmen t appliation. Referenes [1℄ An ton y Chazapis, Georgios T souk alas, Georgios V erigakis, K ornilios K ourtis, Aristidis Sotirop oulos, and Netarios K oziris. Global-sale p eer-to-p eer le servies with dfs. In GRID , pages 251258, 2007. [2℄ Ra hid Guerraoui, Mi hal Kapalk a, and Jan Vitek. STMBen h7: a b en hmark for soft w are transational memory . In Eur o. Conf. on Comp. Sys. (Eur oSys) , pages 315324, 2007. [3℄ James J. Kistler and M. Sat y anara y anan. Disonneted op eration in the Co da le system. A CM T r ans. on Comp. Sys. (TOCS) , 10(5):325, F ebruary 1992. [4℄ Puneet Kumar and M. Sat y anara y anan. Flexible and safe resolution of le onits. In Usenix T e h. Conf. , New Orleans, LA, USA, Jan uary 1995. [5℄ Leslie Lamp ort. Time, lo ks, and the ordering of ev en ts in a distributed system. Communi- ations of the A CM , 21(7):558565, July 1978. [6℄ A. Muthita haro en, R. Morris, T. Gil, and B. Chen. Ivy: A read/write p eer-to-p eer le system. In Symp. on Op. Sys. Design and Implementation (OSDI) , Boston, MA, USA, Deem b er 2002. Usenix. [7℄ Gérald Oster, P asal Urso, P asal Molli, and Ab dessamad Imine. Data onsisteny for P2P ollab orativ e editing. In Int. Conf. on Computer-Supp orte d Co op er ative W ork (CSCW) , pages 259268, Ban, Alb erta, Canada, No v em b er 2006. A CM Press. [8℄ Nuno Preguiça, Mar Shapiro, and J. Legatheaux Martins. Automating seman tis-based reoniliation for mobile transations. In CFSE'3 : onfér en e fr ançaise sur les systèmes d'exploitation , pages 515524, La-Colle-sur-Loup, F rane, Otob er 2003. RR n ° 6546 28 Benmouok et al. [9℄ Nuno Preguiça, Mar Shapiro, and Caroline Matheson. Seman tis-based reoniliation for ollab orativ e and mobile en vironmen ts. In Int. Conf. on Co op. Info. Sys. (Co opIS) , v olume 2888 of L e tur e Notes in Comp. S. , pages 3855, Catania, Siily , Italy , No v em b er 2003. Springer-Verlag Gm bH. [10℄ P eter Reiher, John S. Heidemann, Da vid Ratner, Gregory Skinner, and Gerald J. P op ek. Resolving le onits in the Fius le system. In Usenix Conf. Usenix, June 1994. [11℄ Y asushi Saito and Mar Shapiro. Optimisti repliation. Computing Surveys , 37(1):4281, Mar h 2005. [12℄ Mar Shapiro, Karthik ey an Bharga v an, and Nishith Krishna. A onstrain t-based formalism for onsisteny in repliated systems. In Int. Conf. on Priniples of Dist. Sys. (OPODIS) , n um b er 3544 in Leture Notes in Comp. S., pages 331345, Grenoble, F rane, Deem b er 2004. [13℄ Mar Shapiro, Nuno Preguiça, and James O'Brien. Rus: mobile data sharing using a generi onstrain t-orien ted reoniler. In Conf. on Mobile Data Management , pages 146151, Berk e- ley , CA, USA, Jan uary 2004. [14℄ Chengzheng Sun, Xiaoh ua Jia, Y an h un Zhang, Y un Y ang, and Da vid Chen. A hieving on- v ergene, ausalit y preserv ation, and in ten tion preserv ation in real-time o op erativ e editing systems. T r ans. on Comp.-Human Inter ation , 5(1):63108, Mar h 1998. [15℄ Pierre Sutra, João Barreto, and Mar Shapiro. Deen tralised ommitmen t for optimisti se- man ti repliation. In Int. Conf. on Co op. Info. Sys. (Co opIS) , Vilamoura, Algarv e, P ortugal, No v em b er 2007. [16℄ Douglas B. T erry , Alan J. Demers, Karin P etersen, Mik e J. Spreitzer, Marvin M. Theimer, and Bren t B. W el h. Session guaran tees for w eakly onsisten t repliated data. In Int. Conf. on Par a. and Dist. Info. Sys. (PDIS) , pages 140149, Austin, T exas, USA, Septem b er 1994. [17℄ Douglas B. T erry , Marvin M. Theimer, Karin P etersen, Alan J. Demers, Mik e J. Spreitzer, and Carl H. Hauser. Managing up date onits in Ba y ou, a w eakly onneted repliated storage system. In 15th Symp. on Op. Sys. Priniples (SOSP) , pages 172182, Copp er Moun tain, CO, USA, Deem b er 1995. A CM SIGOPS, A CM Press. INRIA Centre de recherche INRIA Paris – R ocqu encour t Domaine de V oluceau - Rocquencourt - BP 105 - 78153 Le Chesnay C edex (France) Centre de recherc he INRIA Bordeaux – S ud Ouest : Domaine Uni versita ire - 351, cours de la Libérat ion - 33405 T alence Ced ex Centre de recherc he INRIA Grenobl e – Rhône-Alpes : 655, avenu e de l’Europ e - 38334 Montbonnot Saint-Ismier Centre de recherc he INRIA Lille – Nord Europe : Pa rc Scientifique de la Haute Borne - 40, avenu e Hall ey - 59650 V illene uve d’Ascq Centre de recherc he INRIA Nanc y – Grand Est : L ORIA, T echnopôle de Nancy-Bra bois - Campus scientifique 615, rue du Jardin Botani que - BP 101 - 54602 V illers-lès-Na ncy Cede x Centre de recherc he INRIA Renne s – Bretagne Atlantique : IRISA, Campus univ ersitaire de Beaulieu - 35042 Rennes Cedex Centre de recherc he INRIA Sacla y – Île-de-France : Parc Orsay Uni versité - ZAC d es Vi gnes : 4, rue Jacques Monod - 91893 Orsay Cede x Centre de recherc he INRIA Sophia Antipolis – Méditerranée : 2004, route des Lucioles - BP 93 - 06902 Sophia Antipolis Cedex Éditeur INRIA - Domaine de V olucea u - Rocquenc ourt, BP 105 - 78153 Le Chesnay Cede x (France) http://www.inria.fr ISSN 0249 -6399
Original Paper
Loading high-quality paper...
Comments & Academic Discussion
Loading comments...
Leave a Comment