The OpenCPU System: Towards a Universal Interface for Scientific Computing through Separation of Concerns

Applications integrating analysis components require a programmable interface which defines statistical operations independently of any programming language. By separating concerns of scientific computing from application and implementation details w…

Authors: Jeroen Ooms

The Op enCPU Sys tem: T o w ards a Univ ersal In terface for Scien tific Computing through Separa t i on of Concerns Jero en Ooms June 19, 2014 Abstract Applications integ rating analysis comp onen ts require a programmable interface which d efines sta- tistical op erations indep endently of any p rog ramming language. By separating concerns of scientific computing from application and implementation details w e can derive an interoperable API for data analysis. But what exactly are the concerns of scien tific comput ing? T o answ er this question, t he paper starts with an e x p lora tion of t h e p urpose, problems, c haracteristics, struggles , culture, and communit y of this unique b ranc h of computing. By mapping out the domain logic, w e try to unv eil the fundamen tal principles and concep ts b ehind statisti cal softw are. A long the w ay we highligh t important problems and b ottlenec ks that n eed to b e addressed by t he system in order to facilitate reliable and scalable analysis units. Finally , the OpenCPU softw are is introdu ced as an example implementation that builds on HTTP and R to exp ose a simple, abstracted in terface for scientific computing. 1 In tro duction Metho ds for scientific computing ar e traditionally implemented in sp ecialized so ft w a r e pack ag es ass isting the statistician in all facets of the data analy sis pro cess. A s ingle pro duct typically includes a wealth of functionality to interactively manage, ex plore and analyze da ta , a nd often muc h mor e. Pro ducts such as R or STATA ar e o ptimized for use via a command line interface ( CLI ) whereas others such as SPSS fo cus mainly on the graphical user int er fa ce ( GUI ). How ever, increasing ly many users a nd or ganizations wish to int egr ate s ta tistical computing into third party softw are. Rather than working in a sp ecialized statistical environmen t, metho ds to ana lyze and visualize data get incor p orated in to pip elines, web applica tions and big data infrastructure s . This wa y of doing data analysis requires a different appr oac h to statistical softw are which emphasizes interopera bilit y and pro g rammable interfaces rather than UI int er a ction. Throughout the pap er we r efer to this appro ac h to statistical softw are as emb e dde d scientific c omputing . Early pio neering work in this area was do ne b y T emple Lang (2000) and Chambers et al. (1998) who devel- op ed an environmen t for integration of statistical softw are in Jav a base d on the C ORBA standard (Henning, 2006). Recent work in em b edded scientific computing has mostly aimed at low-level to ols for directly co n- necting sta tis tica l softw ar e to g eneral purp ose e n vironments. F or R , bindings and br idges are av ailable to execute an R s cript or pro cess fro m inside a ll popular langua ges. F or example, JRI (Urba nek, 2013a), RInsi de 1 (Eddelbuettel and F ra nco is , 2011), rpy2 (Gautier, 201 2 ) or Ri nRuby (Dahl and Cr a wford, 20 09 ) can be used to call R from res p ectively Java , C++ , Pyt hon or Ruby . Heib erger and Neuwirth (2009) provide a set of to ols to run R co de from DCO M clients on Windows, mostly to supp ort ca lling R in Microso ft Exc e l. The r Apache mo dule ( mod R ) makes it p ossible to execute R scripts from the A pache2 web server (Horner, 2013). Similarly , the little r prog ram provides hash- bang c apabilit y for R , a s well as simple command-line and piping use on UNI X (Horner and Eddelbuettel, 2 011). Finally , Rserve is TCP/ IP server which provides low level a ccess to an R pro cess ov er a so c ket (Urbane k , 201 3b ). Even though these langua ge-bridging to ols have been av a ilable fo r several years, they have not b een able to fac ilita te the big br eakthrough of R as a ubiquitous s tatistical engine. Given the eno r mous demand for analysis and visualizatio n these days, the adoption of R for embedded sc ien tific computing is actually quite underwhelming. In my exp erience, the primar y cause for the limited success is that these bridges a re hard to implemen t, do not scale very w ell, a nd leav e the mo s t challenging pr oblems unreso lv ed. Substa ntial plum bing and exp ertise o f R internals is requir ed for building actual a pplications on these to ols. Clients ar e supp osed to g enerate and push R s yn tax, make sense o f R ’s in terna l C structures a nd write their own fra mew ork for managing requests, g raphics, secur it y , data interc hange, etc. Thereby , scientific co mputing gets intermingled with other par ts of the system resulting in highly coupled softw are which is complex and often unreliable. High coupling is als o pro blematic from a human p oin t of view. Building a web application with fo r example Rserve req uir es a web develop er that is also a n ex pert in R , Java and Rser ve . Because R is a very domain sp ecific la nguage, this combination of skills is very rar e and exp ensive. 1.1 Separation of concerns What is needed to scale up embedded s c ie n tific computing is a system that decouples data analys is fro m other system comp onen ts in such a wa y tha t applications can integrate statistical metho ds witho ut detailed understanding of R or statistics. Compo nen t based softw ar e engineer ing advoc ates the des ig n principle o f sep ar ation of c onc erns (Heineman and Councill, 2 001), which states that a computer prog ram is split up int o distinct pie c e s tha t ea ch encapsula te a lo gic al set of functionality b ehind a well-defined interface. This allows for independent de velopment o f v ario us compo nen ts by different people with different background and exp ertise. Separ ation of concer ns is fundamental to the functiona l prog ramming paradigm (Reade, 1989) a s well as the design of service o rien ted architectures o n distributed informatio n sys tems such as the internet (Fielding, 200 0). The principle lies a t the hea r t of this r esearch and holds the key to a dv ancing embedded scientific computing. In or de r to develop a system that sepa rates concerns of scientific computing fr om other parts of the system, we need to ask o ur selv es: what ar e the co ncerns o f scientific c o mputing? This question do es not hav e a straightforward answer. Over the years, statistical softw ar e has go tten hig hly convoluted by the inclusion of complementary to ols tha t are useful but not necessa rily an integral par t of computing. Separation o f concerns r e quires us to e x tract the cor e lo gic and divorce it from all other appara tus. W e need to for m a co nceptual mo del of data analys is that is indep enden t of any particular a pplication or implementation. Therefore, ra ther than discuss ing technical problems, this pap er fo cuses entirely on studying the domain logic of the discipline along the lines of Ev a ns (2003). By explor ing the concepts, problems, and practices o f 2 the field w e tr y to unv eil the fundament al principles b ehind statistical softw are . Along the wa y we hig hligh t impo rtan t problems and b ottlenecks that require further attention in order to facilitate relia ble and s calable analysis mo dules. The end goa l of this pap er is to work tow ards an int er face definition for embedded scientific computing. An int er fa ce is the embo dimen t of separatio n of concerns a nd serves as a contract that formalizes the b oundary across which separate comp onent s exchange informa tio n. The interface definition descr ibes the co ncepts and o perations that comp onents agr ee up on to co oper ate and how the communication is a rranged. In the int er fa ce we s pecify the fun ctiona lit y that a server has to implement, whic h parts of the interaction are fixed and which choices are spe c ifically left at the discretion o f the implementation. Ideally the sp ecification should provide sufficient structure to develop clients a nd ser ver comp onen ts for scientific computing w hile minimizing limitations on how these ca n b e implemented. An interface that carefully is olates comp onents along the lines of domain log ic allows developers to focus on their exp ertise using their to ols of c hoice. It gives clients a universal p oin t of interaction to integrate statistical prog rams without understanding the actual computing, and a llo ws statisticians to implemen t their metho ds for use in applica tio ns witho ut knowing sp ecifics ab out the applica tion layer. 1.2 The Op enCPU system The Open CPU sys tem is an exa mple that illustrates what an abstracted interface to scien tific co mputing could lo ok like. OpenCP U defines a n HTTP API that builds on The R Pr oje ct for S t atistic al Computing , for sho rt: R (R Core T eam, 201 4 ). The R la ng uage is the o bvious candidate for a first implementation o f this kind. It is currently the most p opular statistical so ft ware pac k age and considered by many statisticians a s the de facto standard of da ta a nalysis. The huge R communit y pr ovides b oth the to ols and use-cases needed to develop and ex periment with this new approach to scientific computing. I t is fair to say that currently o nly R has the required sca le and foundatio ns to really put our idea s to the tes t. Ho wev er , a lthough the research and OpenCP U system are co lored b y and tailo red to the wa y things work in R , the appr oac h should gener alize q uite naturally to o ther computational back-ends. The A PI is designed to describ e g eneral logic of da ta analys is rather tha n that of a particular language. The main r ole of the so ft ware is to put this new appro ac h into practice and get firsthand exp erience with the problems a nd opp ortunities in this unexplor ed field. As part of the research, t wo Op enCPU ser v er implemen tations were dev elop ed. The R pack age opencpu uses the htt puv web ser v er (RStudio Inc., 2 014a) to implement a single-user server which r uns within a n int er a ctiv e R sessio n on any platform. The cloud server on the other hand is a multi-user implementation based on U buntu Linux and rApache . T he latter yields m uch b etter perfor mance and has adv anced security and co nfiguration o ptions, but require s a dedica ted Lin ux ser ver. Another ma jo r difference b etw een these implemen tations is how they handle concurrency . B e cause R is single threade d, httpuv handles only a single request at a time. Additional incoming r equests ar e automa tically queued and exec uted in success ion using the sa me pro cess. The cloud se r v er on the other hand takes adv antage of multi-pro cessing in the Ap ache2 web server to handle concurrency . This implemen tation uses forks of the R pro cess to serve co ncurren t requests immediately with little per formance overhead. The differences b et ween the clo ud server and sing le user server ar e in visible the client. The API provides a standard in terface to either implemen tation and other 3 than v ary ing p erformance, applications will behav e the s a me re g ardless of whic h server is used. This already hin ts at the b enefits of a well defined int er fa ce. 1.3 History of Op enCP U The OpenCP U system builds on several years of work dating back to 2009 . The softw are evolv ed thr ough ma ny iterations o f trial and erro r by which we ide ntified the main co ncerns and learned what works in pr actice. Initial inspira tions were drawn fro m r e curring pr oblems in developing R web applications with r Apache , including v an Buuren and O oms (200 9) . Accum ulated ex periences from these pro jects sha ped a vision on what is involv ed with e mbedded scientific computing. After a year of internal development , the fir st public beta of Op enCPU app eared in August 2 011. This version was pick ed up by early a dopters in b oth industry and a cademia, some of which are s till in pr oduction to day . The problems and suggestio ns genera ted from early versions were a g reat source of feedback a nd re vealed some fundamental problems. At the sa me time exciting developments were go ing on in the R co mmunit y , in par ticular the rise of the R Studio IDE a nd int ro duction of p ow erful new R pa c k ages kn itr , evalu ate and htt puv . After a r edesign of the A PI a nd a complete r e write of the co de, O penCPU 1.0 was released in August 20 13. By ma king b etter use of native features in HTTP , this version is mor e simple, flexible, and extensible than befor e. Subsequent relea s es within the 1 .x series hav e intro duced additional server configur a tions and o ptimizations without ma jo r changes to the API . 2 Practices and domain logic of scien tific computing This se ction provides a helicopter view o f the pra ctices and logic o f scientific computing tha t are mo st relev ant in the context o f this resear c h. The rea der should g et a sense of what is inv olved with scientific computing, what makes data analys is unique, and why the softw are landscap e is dominated by domain sp ecific language s ( DSL ). The topics are chosen a nd pr esen ted somewhat sub jectiv ely based o n my e x periences in this field. They ar e not intended to be exhaus tive or exclus iv e, but ra ther identify the mos t imp ortant co nc e rns for developing embedded analysis comp onen ts. 2.1 It start s with data The r ole and shap e of data is the main characteristic that distinguishes scien tific computing. In most general purp ose progra mming languages, data st ructur es are insta nce s of class es with w ell-defined fields and metho ds. Similarly , databases use schemas or table definitions to enforc e the structur e of data. This ensures that a table returned by a given SQL quer y alwa ys contains exactly the same str ucture with the re quested fields; the only v arying pro perty b etw een se veral executions of a query is the num be r of returned rows. Also the time needed for the databas e to pro cess the request usually dep ends only on the amount of reco r ds in the database. Strictly defined structures make it p ossible to wr ite co de implementing all requir ed op erations in adv a nce without knowing the actual c ontent o f the data. It also creates a clear s eparation b etw een develop ers 4 and users. Most applica tions do not give users direct ac c e ss to raw data . Developers fo cus in implementing co de and designing data struc tur es, wherea s users merely g et to ex e cute a limited set of op erations. This paradigm do es not work for scie n tific co mputin g . Develop ers of statistical soft ware hav e relatively little control ov er the structure , conten t, and quality of the data . Data a nalysis starts with the user supplying a dataset, which is rarely pretty . Rea l world data come in all sha pes and formats. They a re mes s y , have inconsistent structures, and in visible numeric prop erties. Therefor e statistica l pr ogramming languages define data s tr uctures relatively lo osely and instea d implemen t a rich lexicon for int er activ ely manipulating and testing the data. Unlike softw are op erating on well-defined data structures , it is nearly imp ossible to wr ite co de that accounts for any s cenario and will work for ev er y p ossible dataset. Many functions are not applicable to every ins tance o f a particular c la ss, or might b e ha ve differe ntly based o n dynamic pro perties such as size or dimens ionalit y . F or these rea sons there is als o less clear of a sepa ration b etw een develop ers and users in scientific computing. The data ana lysis pro cess involv es sim ultaneo usly debugging of co de and data where the user iter ates ba c k and forth b et ween manipulating and analyz ing the data . Implementations of statistica l metho ds tend to b e very flexible with many parameter s and settings to sp ecify b eha vior for the bro ad rang e of po ssible data. And still the user mig ht have to g o through many steps of clea ning a nd reshaping to give data the appr opriate structure and pr operties to p erform a par ticular ana lysis. Informal op erations a nd lo osely defined da ta structures ar e typical characteristics of scientific computing. They give a lot of freedo m to implement powerful and flexible tools for da ta ana ly sis, but complicate interfac- ing of statistical methods. Embedded systems requir e a degree of t yp e-safet y , predictability , and co nsistency to facilitate reliable I/O b et ween compo nents. These features are native to databases o r many ob ject or ien ted languages , but require substantial effor t for s tatistical so ft w ar e. 2.2 F unctional programming Many different progr amming languag es and styles exists, each with their own str engths and limitations. Scient ific computing langua ges typically use a functional style of prog ramming, where metho ds take a role and notation similar to functions in mathematics. This has obvious benefits for n umeric a l co mputing. Because equa tions a r e typically written a s y = f ( g ( x )) (rather than y = x.g () .f () notatio n), a functional syntax res ults in intu itive c o de for implementing algo rithms. Most p opular gener al purpo se languages take a mor e imperative and o b ject oriented approach. In many wa ys, ob ject-o rien ted progra mming can b e co nsidered the opp osite of functional pr ogramming (A. M. Kuchling, 2014). Here metho ds are inv oked on an ob ject a nd mo dify the state o f this pa rticular ob ject. O b ject-oriented languages typically implemen t inheritance of fields and metho ds based on ob ject classes or proto t yp es. Many softw are engineer s prefer this style of prog ramming beca use it is mor e powerful to handle complex data structures. The succe s s of ob ject or ien ted language s has also influenced scientific co mputing, r esulting in multi-paradigm systems. Languages such a s Jul ia and R use multiple dispatch to dy namically a ssign function calls to a particular function based on the t yp e of ar gumen ts. This br ings certa in ob ject oriented bene fits to functional la nguages, but a lso co mplicates sco ping and inheritance . A compara tiv e review o n progr amming styles is b eyond the scop e of this res earc h. But what is r e le v ant to 5 us is how conflicting pa radigms a ffect interfacing of analysis comp onen ts. In the context of web services, the R epr esent ational St ate T r ansfer style (for short: REST ) describ ed b y Fielding (20 00 ) is very p opular among web developers . A r estful API maps e very URL to a r esour c e and HTTP req ue s ts are used to mo dify the state of a r esource, which r e sults in a simple and elega nt API . Unfortunately , REST do es not ma p very natura lly to the functional para dig m of statistical softw are. Languag es where functions are first class citizens sugg est more RPC flavored interfaces, which according to Fielding are by definition not r estful (Fielding, 200 8 ). This do es not mean that such a co mponent is inco mpatible with other pieces. As long as c o mponents hono r the rules of the pro tocol (i.e. HT TP ) they will work together. Howev er, conflicting pro gramming styles can b e a source of frictio n for embedded scientific co mputing. Strongly ob ject-oriented frameworks or develope r s might r e quire some a dditional effor t to get comfor ta ble with comp onen ts implementing a more functional paradigm. 2.3 Graphics Another somewhat domain sp ecific feature of scientific co mputin g is native supp ort for gr aphics. Most statistical softw are pa c k ages include pr ograms to draw plots and c har ts in some form or another. In contrast to data and functions whic h a re la ng uage ob jects, the g raphics device is consider e d a separate output strea m. Drawings on the ca n v a s are implemented a s a side effect rather than a return v alue of function calls. This works a bit s imilar to manipulating do cumen t ob ject mo del ( DOM ) e le men ts in a browser using Java Script . In most in teractive statistical s oft ware, graphics a ppear to the user in a new windo w. The state of the graphics device cannot eas ily b e stored or seria lized as is the case for functions and ob jects. W e can exp ort an image of the gra phics device to a file using png , svg or p df format, but this image is merely a snapshot. It do es no t co n tain the actual state of the device cannot b e reloa ded for further manipula tion. First cla ss citizens hip o f g raphics is a n imp ortant co ncern o f interfacing scientific computing. Y et output containing b oth data and gr aphics makes the desig n of a gener al purp ose API mor e difficult. The system needs to capture the r eturn v a lue as well as g raphical side effects of a r emote function ca ll. F urther more the interface should allow for genera ting g r aphics without imp osing restrictio ns on the for mat or formatting parameters . Users w ant to utilize a simple bitmap for mat such as png for previewing a gr a phic, but hav e the option to expo rt the same graphic to a high qualit y v ector base d for mat such as pdf for publication. Because statistical computation is exp ensive and non-deterministic, the gra phic cannot simply reco nstructed fro m scratch only to r etriev e it in another for mat. Hence the AP I needs to incorp orate the notio n of a graphics device in a wa y indep enden t o f the imaging format. 2.4 Numeric prop erties and missing v alues It was alr eady men tioned how lo osely defined data structures in scientific computing ca n impede type safety of data I/O in a nalysis co mponents. In a ddition, sta tistical metho ds can chok e on the actual conten t of data as well. Sometimes pr oblematic data ca n easily be sp otted, but often it is nearly imp ossible to detect these ahead of time. Applying statistical pro cedures to these data will then result in errors, even though the co de and structure o f the data are p erfectly fine. These pr oblems frequently arise for s tatistical models that 6 build on matrix decomp ositions which r equire the data to follow certa in numeric prop erties. The r ank o f a matrix is one s uch prop ert y which measur es the nondegenera teness o f the system of linear equations. When a matrix A is rank deficient, the equation Ax = b do es no t have a solution when b do es not lie in the range of A . Attempting to so lv e this equation will even tually lead to div ision by zero . Accoun ting for such cases of time is near ly imp ossible b ecause numeric prop erties are invisible un til they are a c tually calcula ted. But per haps just as difficult is expla ining the us e r or softw are eng ineer that these err ors ar e not a bug , a nd can not b e fixed. The pro cedure just do es not work for this particular datas e t. Another case of problematic data is presented by m issing v alues. Missingnes s in statistics means tha t the v a lue of a field is unknown. Mis s ing data should not b e co nfused with no da ta or n ull . Miss ing v alues are often non ignor able , meaning that the missingnes s itself is informatio n that needs to b e accounted for in the mo deling. A standard textb o ok exa mple is when w e p erform a survey asking p eople ab out their salary . Beca use some p eople migh t refuse to provide this information, the data contains missing v alues. This missingness is proba bly n ot c ompletely at r andom : resp ondent s with high sa laries might b e mor e reluctant to provide this information than resp onden ts with a median salary . If w e calculate the mean salary from our data ignoring the missing v alues, the es tima te is likely bias ed. T o obtain a more accur ate e s timate of the av erag e salary , missing v alues need to b e incorp orated in the estimation using a more so phisticated mo del. Statistical prog ramming lang uages can define several types of missing or non-finite v alues such as NA , NaN or Inf . These are usually implemented a s sp ecial primitives, which is one of the b enefits of using a DSL . F unctions in statis tica l soft ware hav e built-in pro cedures and options to sp ecify how to handle missing v alues encountered in the data . Howev er, the notion of miss ingness is foreign to most languages and softw ar e outside of sc ien tific computing. They a re a t ypical domain-sp ecific phenomenon tha t can cause tec hnical problems in data exchange with other sy s tems. And like n umeric prop erties, the concept of v alues containing no ac tua l v a lue is likely to cause confusio n among develop ers or users with limited exp erience in data analysis. Y et failure to prop erly incorp orate mis sing v alues in the da ta can easily lead to error s or incorrect results, a s the example ab o ve illustrated. 2.5 Non det er min istic and unpredictable beha vior Most so ft ware applications are exp ected to pro duce consistent output in a timely manner , unless something is very wro ng. This do es not gener ally hold for scientific computing. The previo us sectio n explained how problematic data can cause exceptions or unexp ected results. But man y analysis methods ar e ac tua lly non-deterministic or unpre dic ta ble by nature. Statistical alg orithms o ften r epeat so me calcula tion until a particular con vergence criterion is rea c hed. Start- ing v alues and minor fluctuatio ns in the data can hav e snowball effect on the course of the algo rithm. There- fore s ev eral runs can result in wildly v arying outcomes and completion times. Moreo ver, conv ergence might not be guaranteed: unfortunate input can get a pro cess stuck in a loca l minimum o r s e nd it off into the wr ong direction. Pre dic ting and controlling fo r such scena rios a-prio ri in the implemen tation is very difficult. Monte Carlo techniques are even less predictable b ecause they are sp ecifically designed to b ehav e r a ndomly . F or example, MCMC metho ds use a Ma rk ov-Chain to simulate r andom walk thro ugh a (high-dimensio nal) spa ce such as a multiv a riate proba bilit y density . These metho ds are a p ow erful to ol for simulation studies a nd 7 nu meric al integration in Bay esian analy s is. Each execution of the ra ndom walk y ields different outcomes, but under general co nditions the pro cess will conv erg e to the v a lue of interest. Howev er, due to ra ndomness it is p ossible that so me of the runs or chains get stuck and need to b e terminated o r disrega r ded. Unpredictability of s ta tistical methods underlies man y tec hnical problems for em b edded scientific computing that a re not pres en t when in teracting with a da ta base. This can sometimes surpr ise softw ar e e ngineers exp ecting deter ministic b e ha vior. Statistical metho ds a re ra rely abso lutely g ua ran teed to b e suc c e ssful for arbitrar y data. Assuming that a pro cedure will always return timely and consis tently because it did s o with testing data is v ery dangerous . In a console, the user can easily intervene or recov er , a nd retr y with different options or star ting v a lues. F or e mbedded mo dules, unpredictability needs to be accounted for in the design of the system. At a very minim um, the system should b e able to detect and terminate a pro cess that has no t completed when some timeout is rea ched. But preferably we need a lay er or meta functionality to control and monitor executions, either manually o r automatically . 2.6 Managing experimen tal soft w are In scientific computing, we usually need to work with in ven tive, volatile, and exp erimen tal softw are. This is a big cultur a l difference with many ge ne r al purpo se lang uages such as python , Java , C++ or Java Script . The latter communities include professiona l org anizations and engineers c ommitted to implementing and maintaining pro duction quality libr a ries. Most autho r s o f op en source statistical so ft ware do not hav e the exp ertise and resourc es to meet such sta ndards. Con tributed co de in la ng uages like R was often wr itten by academics or studen ts to accompany a scientific article pr oposing novel mo dels, algorithms, or progr amming techn iques . The script or pack age serves as an illustration o f the pres en ted ideas, but needs nee ds to b e t weaked and tailor ed to fit a particular pr o blem or da taset. The quality of such contributions v aries a lot, no active suppo rt o r maintenance should b e ex p ected from the a uthors. F ur thermore, pack a ge up dates can sometimes introduce ra dical changes ba sed on new ins igh ts. Because traditiona l data analysis do es not really have a notion of pro duction, this has never b e en a ma jor problem. The e mpha sis in statistical softw are has always b een on innov ation ra ther than contin uity . E xper- imen tal c ode is usually go o d enough for interactive data analysis where it suffices to manually make a scr ipt or pack a ge work fo r the datas e t at hand. Authors o f statistical softw are tend to assume that the user will sp end some effort to manage dep endencies and debug the co de. How ever, integrated comp onents re q uire a greater degr ee of r eliabilit y and co n tin uity w hich int r o duces a s ource of technical a nd cultural fric tion for embedded scientific c omputing. This makes the abilit y to manage unstable softw are, facilitate ra pid change, sandb o x mo dules, and manage failure imp ortant co ncerns of embedded scientific computing. 2.7 In teractivity and error handling In g eneral purp ose la nguages, run-time err ors are typically caused by a bug or s ome sor t of system failure . Exceptions ar e only rais ed when the softw ar e can not recover and usually result in ter mination o f the pr ocess. Erro r messages contain information suc h as calling stacks to help the pr ogrammer discov er wher e in the co de a pr o blem o ccurred. Soft ware engineers go throug h great trouble to preven t p otential problems ahea d of 8 time using smart co mpilers, unit tests, a utomatic co de analysis, and co n tin uous integration. Erro rs that do arise during pro duction ar e usually not displayed to the user, but rather the administrator is no tified that the system urgently needs attention. The us er gets to s ee a n ap ology at be s t. In scientific computing, erro rs play a very differe nt r ole. As a cons e quence of some of the characteristics discussed ea r lier, interactive debugging is a natural par t of the user exp erience. E rrors in statistics do not necessarily indicate a bug in the softw are, but rather a problem with the data or s o me interaction o f the co de and data. The statisticia n go es ba c k and forth b et ween cleaning, ma nipula ting, mo deling, visualizing and interpreting to study patterns and rela tions in the data. This simultaneous debugging o f data and c o de comes down to a lot of tria l and er ror. Problems with outlier s, degre es of freedom or numeric pr operties do not reveal themselves until w e try to fit a mo del or create a plot. Exceptions r aised by statistical metho ds are often a sign that data needs additional w or k. T his makes error messages an impor tan t source of information for the statisticia n to get to know a dataset and its intricacies. And while debugging the data we learn limitations of the a nalysis metho ds. In practice we so metimes find out that a particular dataset req uires us to research or implement additional techniques b ecause the standard to ols do not suffice or are inappr opriate. Int er a ctiv e erro r handling is one of the r easons that ther e is no clear distinction b et ween developmen t and pro duction in scientific computing. When in terfac ing with analysis mo dules it is imp ortant that the ro le of error s is reco gnized. An AP I must b e able to handle exceptio ns a nd rep ort erro r messages to the us e r, and certainly not cr ash the system. The ro le o f er rors and int er activ e debugging in da ta ana lysis c an b e confusing to develop ers o utside of o ur co mm unit y . So me po pular commercial pro ducts seem to hav e propaga ted the belie f tha t data ana ly sis comes down to applying a magical formula to a da ta set, and no in telligent action is r equired o n the s ide of the use r . Sys tems that o nly supp ort such canned a nalyses don’t do justice to the wide r ange of metho ds that statistics has to offer. In pra ctice, interactiv e data debugging is an impor tan t concern of data ana lysis and embedded scientific co mputing. 2.8 Securit y and resource con trol Somewhat r elated to the ab ov e are sp ecial needs in ter ms of security . Mo st statistical softw are currently av ailable is primarily designed for int er a ctiv e use on the lo cal machine. Therefore a ccess control is not considered a n is sue and the executio n e n vironment is entirely unrestricted. Embedded mo dules or public services require implementation of security po licies to preven t ma licious or e x cessiv e use of r esources. This in itself is no t a unique pr oblem. Most scr ipting languages such as php or pyt hon do no t enfor ce any access control and assume security will be implemented on the applicatio n level. But in the case of scientific computing, tw o domain sp ecific asp ects further complicate the pr oblem. The first issue is that statis tica l soft ware ca n be demanding a nd gre edy with hardware r e sources. Numerica l metho ds ar e exp ensive b oth in terms of memo ry a nd cpu. F air -use po licies are not r eally feasible be c ause excessive use of resour ces o ften happ ens uninten tionally . F or example, an overly complex mo del sp ecification or algorithm getting stuck could end up cons uming all av ailable memor y and cpu until ma n ually terminated. When this ha ppens on the lo cal machine, the user c an easily interrupt the pro cess prematur ely by s ending a SIGIN T (press ing C TRL+C or E SC ), but in a shar e d environment this needs to b e r egulated by the sys- tem. E mbedded s cien tific computing req uir es technology a nd p olicies that can mana ge and limit memo ry 9 allo cation, cy cles, disk space, conc ur ren t pr o cesses , netw ork traffic, etc. The degr ee of flexibility offered b y implemen tation of resour ce manag emen t is an imp ortant factor in the scalability of a s y stem. Fine gra ined control ov er system resources consumed by individua l tasks allows for serving many users without sacrificing reliability . The s e cond domain sp ecific security issue is caused by the need for arbitrar y co de execution. A tr aditional application s ecurit y mo del is based on user ro le pr ivileges. In a sta ndard web applica tion, only a develop er or administrator can implement and deploy a ctual co de. The application merely exp oses prede fined func- tionality; users are not allow ed to execute arbitr ary co de on the server. An y p ossibility of co de injection is considered a secur it y vulner abilit y a nd when found the ser v er is p oten tially compromis e d. Howev er as already mentioned, lac k of segregatio n b et ween users and dev elo p ers in statistics gives limited use to applica- tions that res trict users to predefined scripts a nd canned services. T o supp ort actual da ta analy sis, the user needs access to the full la nguage lexicon to freely explore and manipula te the da ta. The need for arbitra r y co de execution disqualifies user role based privileges a nd demands a more sophisticated se c ur it y mo del. 2.9 Repro ducible research Replication of findings is one of the main principle s of the scientific method. In q uan titative re s earch , a nec- essary condition for replication is repro ducibilit y of results. The g o al o f repro ducible resear c h is to tie sp ecific instructions to data ana lysis and ex p erimental da ta so that scholarship ca n be recr eated, b etter understo o d, and v er ifie d (Kuhn, 2014). Even though the ideas of replication are a s old as science itself, repro ducibilit y in scientific computing is s till in its infancy . T o ols a r e av aila ble that a ssist user s in do cumen ting their actions, but to most r esearchers these are not a natur al part of the daily workflow. F or tuna tely , the imp ortance of replication in data ana lysis is increasing ly acknowledged a nd supp ort fo r r eproducibility is b ecoming more influen tial in the desig n of statistical so ft w a r e. Repro ducibilit y c hang es wha t constitutes the ma in pro duct of data analysis. Rather than solely output and conclusions, we are interested reco rding and publishing the entire analysis pr o c ess . This includes all data, co de and r esults but a ls o external softw ar e that was use d arr iv e at the results. Reproducibility puts hig h requirements on so ft ware versioning. More than in other fields it is cr ucial that we diligently archiv e a nd administer the precise versions or bra nc hes of all scr ipts, pack ages, libra ries, plugins that were somehow inv olved in the pro cess. If an analysis in volves randomness, it is also imp ortant that we k eep track of which seeds and random n umber gener ators w er e used. In the current design of statistica l softw are, repro ducibility was mostly an afterthought and has to b e ta k en car e of manually . In practice it is tedious and erro r-prone. There is a lot of r oom for improv ement through softw are that incorp orates repr o ducible practices as a natural part of the da ta analy sis pro cess. Whereas repro ducibilit y in statistics is a c knowledged from a transparency and acco un tability po in t of view, it has enormous p otent ial to b ecome muc h more than that. There are interesting pa rallels b et ween repr oducible resear ch a nd revis ion co n trol in so urce co de mana gemen t systems. T echnology for auto ma tic repro ducible data a nalysis could revolutionize scientific collab oration, similar to what git has done for s oft ware develop- men t. A sys tem that keeps tr a c k of each step in the analys is pro cess like a c ommit in so ft w ar e versioning would make p eer review or follow-up ana lysis more pra ctical and enjoy able. When colleag ues or reviewers 10 can eas ily repr o duce r esults, test alternative hypotheses o r recy cle data, we achieve grea ter trustw or thiness but also m ultiply return on inv estment of our w or k . Finally an op en kitc hen can help facilitate more natural wa ys of lear ning a nd tea c hing statistics. Rather than relying o n general purp ose textb o oks with a rtificial examples, sc holar s directly s tudy the pra c tice s of pro minen t r esearchers to understand how metho ds are applied in the context of data and pr oblems a s they app ear sp ecifically in their ar ea of interest. 3 The state problem Management of state is a fundamental principle aro und which digital communications are designed. W e distinguish stateful a nd stateless co mm unication. In a stateless communication proto col, in tera c tion in- volv es indep enden t r equest-resp onse messages in which each request is unrelated by any previous request (Hennessy and Patterson, 201 1 ). Beca use the mes sages a re indep enden t, there is no particular o r dering to them a nd requests can be p erformed concur ren tly . E x amples of stateless proto cols include the internet pro- to col ( IP ) and the hypertext transfer proto col ( HTT P ). A stateful pr otocol on the o ther hand consists of an int er a ction via a n or dered sequence of interrelated messages. The sp ecification t ypically pre s cribes a sp ecific mechanism for initiating and terminating a p ersisten t c onne ct io n for information exchange. Exa mples of stateful proto cols include the tra nsmission control proto col ( TCP ) or file trans fer proto col ( FTP ). In mos t data analysis softw are, the user co n trols a n interactive ses sion thr ough a c o nsole or GUI , with the po ssibilit y of executing a se q uence of opera tions in the form o f a script . Scripts ar e useful for publis hing c o de, but the mos t p o werful wa y of using the softw are is interactively . In this resp ect, statistica l so ft w ar e is not unlike to a shell interface to the op erating system. In teractivity in scientific computing makes mana gemen t of state the most central challenge in the in terface des ign. When moving from a UI to API p ersp ectiv e, suppo rt for statefulness b ecomes substantially more complicated. This sectio n discusses how the existing bridges to R have appro ac hed this proble m, and their limitations. W e then contin ue by ex plaining how the OpenCP U AP I exploits the functional pa radigm to implement a hybrid solution tha t abstra cts the notion of state and allows for a high degr ee of p erformance optimization. 3.1 Stateless solutions: prede fined scripts The eas iest solution is to not incorp orate sta te o n the level of the interface, a nd limit the sys tem to pre- defined scripts. This is the sta ndard appro ac h in tra ditional web development. The web server exp oses a parameterize d service which g enerates dynamic cont ent by calling out to a script on the s ystem via CGI . Any suppo rt for state has to b e implemen ted manually in the application lay er, e.g. b y writing co de that s tores v a lues in a database. F or R we can use rApa che (Horner, 20 13 ) to develop this kind o f scripted applications very similar to web scripting langua ges such a s php . This s uffices for relatively simple serv ic es that exp ose limited, predefined functionality . Scripted solutions give the developer flexibility to fr e ely define input and output tha t ar e needed for a pa rticular applica tio n. F or exa mple, we can write a script that gene r ates a plot based on a couple of input pa rameters and returns a fixed size png image. Beca us e scr ipts are stateless, m ultiple reques ts can b e per formed concurr e n tly . A lot of the ear ly work in this res earch has bee n ba s ed on 11 this appro ac h, which is a nice s tarting po in t but b ecomes increasing ly problematic for more so phisticated applications. The main limitation of scripts is that to supp ort basic interactivity , r eten tion of sta te needs to b e implement ed manually in the application lay er. A minimal application in s tatistics consists of the user uploading a data file, p erforming some manipulations and then creating a mo del, plot or r eport. When using s c r ipts, the application develop er needs to implement a framework to manage r equests from v ar ious user sessions, and store in termediate results in a database o r disk. Due to the c omplexit y of o b jects and da ta in R , this is muc h more inv olved than it is in e.g . php , and requires pro gramming sk ills. F urthermor e it leads to co de that int er ming les scientific computing with applicatio n logic, and rapidly incr eases complexity as the applicatio n gets extended with additional scripts. Be c ause these problems will r ecur for almost any statistical applica tion, we could b enefit grea tly from a system that supp orts retaining state by design. Moreov er pr edefined scripts are pr o blematic b ecause they divide developer s and user s in a wa y that is not very natural for scientific computing. Scripts in traditional w eb developmen t give the c lien t very little power to prevents malicious use of s ervices. Howev er, in scientific computing, a s cript often merely serves as a starting p oin t for a nalysis. The user wan ts to b e able to mo dify the script to lo ok a t the data in another wa y by trying a dditional metho ds or differen t pro cedures. A system that only allows for performing scripted actions s e v erely handicaps the client and creates a lot of work for developer s: b ecause a ll functionality has to be pr escripted, they are in charge of designing and implementing ea c h p ossible action the us er might wan t to p erform. This is impra ctical for sta tistics b ecause of the infinite amo un t o f op erations that ca n b e per formed on a dataset. F or these reasons, the stateless scripting approach do es not scale w ell to man y users or complex applica tions. 3.2 Stateful solution: clien t side pro cess managemen t Most existing bridges to R hav e taken a stateful approach. T o ols s uc h as R serve (Urba nek, 201 3b ) and shiny (RStudio Inc., 2014 b ) exp ose a low-level in terfac e to a priv ate R pro cess ov er a (web)socket. This gives clie n ts fre e do m to r un arbitrar y R co de, which is grea t for implementing so mething like a web-based console or I DE . The ma in problem with existing stateful solutions is lack of in ter o perability . Beca use these to ols are in ess ence a remote R cons o le, they do not sp ecify any standardized interface for calling metho ds, data I/O , etc. A low-level interface requires extensive kno wledge of log ic and in terna ls of R to communicate, which ag ain leads to hig h coupling. The client needs to b e aw ar e of R syntax to call R metho ds, interpret R data structures, ca pture gra phics , etc. These bridges a re t ypica lly intended to b e used in com bination with a sp ecial client. In the case of shiny , the server comes with a set o f widget templates that can b e customized from within R . This allows R users to create a basic web GUI without writing an y HTML or J avaScript , which can be very useful. Ho wev er , the shiny softw are is not de s igned for integration with non-shiny clients and serves a somewha t different pur p ose and audience than to ols for embedded scientific computing. Besides high coupling a nd lack of interoper a bilit y , stateful bridges also in tro duce some tec hnical difficulties. Systems that allo cate a priv ate R pro cess fo r each client ca nno t supp o rt concurr en t reques ts within a sess ion. Each incoming request has to w ait until the pr evious requests ar e finished for the pro cess to be come av a ilable. In addition to s uboptimal p erformance, this can als o b e a source of instability . When the R pro cess gets 12 stuck or raise s an unexp ected err or, the ser ver might b ecome unresp onsive causing the application to cra sh. Another drawbac k is that stateful servers are extremely exp ensiv e and inefficient in ter ms of memo r y allo ca- tion. The se rv er has to keep each R pro cess alive for the full dur a tion of a s ession, ev en when idle most of the time. Memory that is in us e by any single client do es not free up until the user closes the application. This is particula rly unfortunate b ecause memory is usually the main b ottleneck in data intensiv e applications o f scientific co mputing. Moreov er , connectivity pr oblems or ill-b eha ved client s require mechanisms to timeout and terminate inactive pro cesses, o r sav e and restore an entire session. 3.3 A hyb rid solution: functional state W e ca n take the bes t of b oth worlds by a bstracting the no tion of state to a higher level. Interactivity and state in O penCPU is provided through p e rsistence of obje cts r ather than a p ersistent pr o c ess . As it turns out, this is a natural and p ow erful definition o f state within the functional pa radigm. F unctiona l pro gramming emphasizes that output fro m metho ds dep ends only on their inputs and not on the program state. Therefore, functional languag es can supp ort state without keeping an en tire pro cess alive: mer ely retaining the state of ob jects should b e sufficient. As was discussed befo r e, this has o b vious para llels with mathematics, but als o maps beautifully to stateless protoc ols such as HTTP . The notion of sta te as the set of ob jects is a lready quite natural to the R user , as is a pparen t fro m the save.im age function. This function seria lizes all o b jects in the global environment to a file o n disk whic h desc r ibed in the do cumen tation as “saving the current w or k space”. Exploiting this same notion of state in o ur interface allo ws us to get the b enefits of b oth tr aditional stateless and stateful appr oac hes witho ut intro ducing additional complexity . This simple observ ation provides the basis for a very flexible, stateful RP C system. T o fa cilitate this, the O penCPU API defines a mapping b et ween H TTP reques ts and R function calls . After executing a function call, the server stores a ll outputs (return v alue, g raphics, files) a nd a t emp or ary key is given to the client. This key ca n b e used to control these newly created r esources in future requests. The client can r etriev e ob jects and gra phics in v arious formats, publish res o urces, or use them as ar gumen ts in subsequent function calls. An interactive applicatio n consists o f a series of RP C req uests with keys referencing the ob jects to b e re us ed as ar gumen ts in consecutive function ca lls, mak ing the individua l requests techn ica lly stateless. B esides r educed co mplex it y , this system makes parallel computing and a s ync hronous requests a natural pa rt of the interaction. T o c o mpute f ( g ( x ) , h ( y )), the client can p erform RP C req uests for g ( x ) and h ( y ) simultaneously and pass the r esulting keys to f () in a second s tep. In an asynchronous client language such as J avaScript this happ ens so naturally that it requir es almost no effo r t fro m the user or a pplication developer. One imp ortant detail is that OpenCP U delibe r ately do es not prescrib e how the server should implement sto ring and loading of o b jects in b et ween requests. The AP I only sp ecifies a sys tem for p erforming R function calls ov er H TTP and referencing ob jects fro m keys. Differen t ser v er implementations c an use differ e n t strategies for re ta ining such ob jects. A naive implementation could simply ser ialize ob jects to disk a fter e a c h req uest and immediately terminate the pro cess. This is safe and easy , but writing to dis k can b e s low. A more sophisticated implemen tation could k eep ob jects in memory for a while longer , either by k eeping the R pro cess alive or through some sor t of in-memory data base or memcached system. Thereby the r esources do 13 not need to be lo a ded from disk if they ar e reused in a s ubs equen t request shortly after b eing created. This illustrates the kind o f optimization that can b e a chieved by carefully deco upling s e rv er and client c o mponents. 4 The Op enCPU HTTP API This s ection in tro duces the most imp ortant concepts and op erations of the API . At this p oint the concer ns discussed in ea rlier chapters b ecome mo re concr ete a s we illustra te how the pieces come together in the context of R a nd H TTP . It is not the in tent ion to provide a detailed specificatio n of every fea tur e o f the system. W e fo cus on the ma in par ts of the in terface that exemplify the separatio n of concerns cen tral to this work. The o nline do cumen tation and refere nc e implementations ar e the b est so urce of infor mation on the sp ecifics of implemen ting clients and a pplications. 4.1 Ab out HTT P One of the ma jor strengths o f Open CPU is that it builds on the h yp ertext transfer proto col (Fielding et al., 1999). HTTP is the most used a pplication proto col on the internet, and the fo undation of data co mm unication in br o wsers and the world wide web. The H TTP sp ecification is very mature and widely implemented. It provides a ll functionality requir ed to build mo dern applicatio ns and ha s recently g ained p opularity for web API ’s as well. The b enefit of using a standar dized application proto col is that a lot o f functionality gets built-in b y design. H TTP has excellen t mec hanisms for authentication, encryption, caching, distribution, concurrency , erro r handling, etc. This allows us to defer most applicatio n logic of our system to the proto col and limit the A PI sp ecification to logic o f scientific computing. The Open CPU A PI defines a mapping b et ween HTTP requests and high-level o perations such a s calling functions, running scripts, access to data, manual pag es and manag emen t of files and ob jects. The AP I delib erately do es not prescribe an y la nguage implemen tatio n details. Syntax and low-lev el concerns suc h a s pr ocess management or co de ev aluation are a bstracted and a t the discr e tion of the serv er implementation. The API also do es not descr ibe any logic which ca n b e ta k en care of on the pr otoco l or applica tion layer. F or example, to a dd supp ort for a uthentication, any of the standar d mechanisms ca n b e used such a s bas ic auth (F ranks et a l. , 1999) o r OAut h 2.0 (Hardt, 20 12 ). The implemen tation of suc h authen tication methods might v a ry fro m a simple server configur ation to defining additional endp oint s. But b ecause authent ica tion will not affect the mea ning of the A PI itself, it can b e considered indep enden t of this r esearch. The same holds for other features of the HT TP proto col whic h can b e used in conjunction with the Open CPU API (or an y other HTT P in terface for that matter). What remains after cutting o ut implemen tation and a pplication logic is a s imple and interopera ble interface that is easy to understand a nd c a n b e implemented w ith standar d HTTP softw ar e libraries . This is an e no rmous a dv antage over many other bridg es to R and critical to make the system scala ble and extensible. 14 4.2 Resource t yp es As was descr ibed earlier , individual requests within the Op enCPU API ar e stateless and ther e is no notion of a pr o c ess . State of the sys tem changes through creatio n and manipulation of resources . This makes the v a r ious resour ce types the conce ptual building blo c ks o f the API . Each resour c e type has unique prop erties and supp o rts different op erations. 4.2.1 Ob jects Ob jects a re the ma in e n tities of the sy s tem a nd car ry the same mea ning as within a functiona l la nguage. They include data structure s, functions , o r other types supp orted by the back-end la nguage, in this case R . Each ob ject has a n individua l endp oint within the A PI a nd unique name or key within its names pa ce. The client needs no knowledge of the implement atio n of thes e ob jects. Analogous to a UI , the prima r y purp ose of the API is managing o b jects (crea ting, retrieving, publishing) and p erforming pro cedure ca lls. O b jects created from executing a s cript or returned b y a function call are a utomatically stor ed a nd gain the same status as other existing ob jects. The AP I do es not disting uish b etw een s tatic ob jects that app ear in e.g . pack ages, or dynamic ob jects created b y users, nor do es it distinguish betw een ob jects in memory or on disk . The API merely provides a system for referencing o b jects in a wa y that allows clients to control a nd reus e them. The implementation of p e rsistence, cac hing and ex piration of ob jects is at the discretion o f the server. 4.2.2 Namespaces A namespac e is a collection of uniquely named ob jects with a given pa th in the API . In R , sta tic namespaces are implemen ted using p ackages a nd dynamic namespa c es exist in envir onments s uc h a s the user workspace. OpenCP U abstracts the concept of a names pace as a s et o f uniquely named ob jects and do es not distinguish betw een static, dy namic, p ersistent or temp orar y na mespaces. Client s can reques t a list o f the conten ts o f any namespace , yet the ser v er might refuse such a request for priv ate namespa c es or hidden ob jects. 4.2.3 F ormats OpenCP U ex plic itly differentiates a resour c e from a r epr esentation o f that res ource in a particular format . The API lets the client ra ther than the server decide on the for mat used to serve conten t. This is a difference with common scientific practices of exchanging data, do cumen ts and figur e s in fix ed format files. Resources in Open CPU can b e r etriev ed using v a rious output for mats and for matting para meters. F o r exa mple, a ba sic dataset can be retrieved in csv , jso n , Prot ocol Buffe rs or tab deli mited for mat. Similarly , a gra phic can b e retrieved in svg , p ng o r pdf and manual pages can b e retrieved in t ext , h tml or pdf for mat. In addition to the format, the clie n t can sp ecify fo rmatting pa rameters in the r equest. The sys tem supp orts many a dditional for mats, but not every format is appropriate for every resour ce type. When a client requests a resour c e in a format using a n inv alid format, the ser v er resp onds w ith an erro r . 15 4.2.4 Data The A PI defines a sepa r ate ent ity for data o b jects. Even thoug h data ca n technically be treated a s g eneral ob jects, they often ser ve a different purp ose. Data are usually not languag e sp ecific and cannot b e called or executed. Therefore it can b e us eful to co nceptually distinguish this sub class. F or ex ample, R uses lazy loading of data ob jects to sav e memory when for pack ages c o n taining larg e datasets. 4.2.5 Graphics An y function call can pr oduce zer o o r more g raphics. After completing a remote function call, the server rep orts how many g raphics were cr eated and provides the key for refere ncing these g raphics. Clien ts can retrieve each individual graphic in subsequent req uests using one of v arious output forma ts such as png , p df , and svg . Where appropr iate the c lie n t can sp ecify additiona l formatting parameter s during the retriev al of the gra phic such as width, height or font size. 4.2.6 Files Files can b e uploaded a nd downloaded us ing standa rd HTT P mechanics. The client can po st a file as an argument in a remo te function call, or download files that w ere sav ed to the w orking directory b y the function call. Supp ort for files also allows for hosting w eb pa ges (e.g . htm l , cs s , js ) tha t interact with lo cal API e ndp oints to serve a web application. F urthermor e files that a re recog nize d a s scripts ca n b e exe c uted using RP C . 4.2.7 Man uals In most scientific computing langua ges, each function or datas e t that is av ailable to the us e r is accompanied by a n identically named manual pag e. This manual page includes information such as description and usage of functions and their arguments, or co mmen ts ab out the columns of a particula r dataset. Manual pages can b e r etriev ed throug h the API in v ario us formats including text , html and pdf . 4.2.8 Sources The OpenC PU sp ecification makes re pr oducibility a n int eg r ated part o f the A PI interaction. In addition to results, the se r v er sto res the call and ar gumen ts for each RPC request. The same key that is used to retrieve ob jects or g raphics can be used to r etriev e s ources or a utomatically replicate the computation. Hence for each output r esource on the system, clients can lo okup the c o de, da ta, warnings and pack ages that were inv olved in its crea tion. Thereby results can ea sily b e reca lculated, which forms a p o werful bas is for r eproducible practices. This fea ture ca n b e used fo r o ther purp oses as well. F o r example, if a function fetches dyna mic data from an external resource to generate a model or plo t, r eproductio n is used to up date the mo del or plot with new data. 16 4.2.9 Con tainers W e refer to a path on the ser ver containing one or mo re collections of r esources as a c ont ai ner . The curre nt version of Open CPU implement s tw o t yp es of containers. A p ackage is a static c on tainer which ma y include a namespace with R ob jects, manual pages , data and files. A s ess ion is a dy namic co n tainer which holds outputs created from executing a script or function call, including a namespa c e with R ob jects, graphics and files . The distinction b et ween pack ages and sess ions is an implementation detail. The A PI do es no t differentiate betw een the v arious container types: int er acting with an ob ject or file works the s ame, r egardless of whether it is part of a pack age or session. F uture versions or other se r v ers might implement differen t c o n tainer types for grouping collectio ns o f r esources. 4.2.10 Libraries W e refer to a collec tion of containers as a libr ary . In R terminolog y , a librar y is a directory on disk with installed pack ages . Within the context of the A PI , the c o ncept is not limited to pack ag es but refers mo r e generally to any set of con tainers . The /ocpu /tmp/ librar y for example is the co lle ction o f temp orar y sessions. Also the API notion o f a library do es not requir e con tainer s to b e pre ins talled. A r emote collection of pack ages, which in R terminolo gy is called a r ep ository , ca n also b e implemented as a libra ry . The current implemen tation o f OpenCPU exp oses the /ocpu /cran/ libr a ry whic h re fers to the cur ren t pa c k ages on the CRAN reposito ry . The API does no t differentiate b et ween a library o f sessions, lo cal pa c k ages or remote pack ages. Interacting with an ob ject from a CRAN pack ag e w or ks the same a s interacting with an ob ject from a lo cal pa c k age or temp o rary session. The API leaves it up to the server which types of librar ies it wishes to exp ose and how to implement this. The current version of O penCPU uses a co m bination of cro n-jobs and on-the-fly pack age installatio ns to s ync hronize pack ages on the ser v er with the CR AN rep ositories. 4.3 Metho ds The curr en t API uses tw o H TTP metho ds: GET and POST . As per HTTP s tandards, GET is a safe metho d which means it is intended only for information reading and should no t change the state of the server. OpenCP U us e s the GE T metho d to retrieve ob jects, manuals, g raphics o r files . The para meters of the reques t ar e mapp ed to the for matting function. A GET req uests targ eting a co n tainer, namespace o r direc to ry is used to list the conten ts. The PO ST metho d on the other is used for RP C which do es change ser ver state. A POST reques t targeting a function results in a re mote function call where the HTTP pa rameters are mapp ed to function arguments. A POST r equest targe ting a script results in an ex ecution of the scr ipt wher e HTTP parameters ar e mapp ed to the script interpreter. T able 1 gives an overview us ing the MASS pack age (V ena bles and Ripley, 2002) as an example. 17 Metho d T ar get A ction Par ameters Example GET ob ject retrieve formatting GET /ocpu/ library/M ASS/data/cats/json manual read formatting GET /ocpu/ library/M ASS/man/rlm/html graphic render formatting GET /ocpu/ tmp/ { key } /graphics/1/png file download - GET /ocpu/ library/M ASS/NEWS path list conten ts - GET /ocpu/lib rary/MASS/scripts/ POST ob ject call function function arguments PO ST /ocpu/libr ary/stats/R/rnorm file run scr ipt control interpreter POST /ocpu /library/ MASS/scripts/ch01.R T able 1 : Currently implemented HTTP metho ds 4.4 Status co des Each HT TP resp onse includes a status co de. T able 2 lists some common HTT P status co des us e d by Op enCPU that the client should be a ble to interpret. The meaning o f these s tatus co des is co nform H TTP sta ndards. The web server ma y use additional status co des for more general pur poses that are not sp ecific to OpenC PU . Status Co de Happ ens when R esp ons e c ontent 200 OK On success ful GET request Reque s ted data 201 Create d On success ful POST reques t Output key and lo cation 302 Found Redirect Redirect lo cation 400 Bad Reque st On computationa l erro r in R Err or message fro m R in te xt/plain 502 Bad Gatew ay Back-end server offline – (See err o r logs) 503 Bad Reque st Back-end server failure – (See err or logs) T able 2: Commonly used HTTP status co des 4.5 Con ten t-types Client s can retrieve ob jects in v ario us formats by adding a format iden tifier suffix to the URL in a GET reques t. Whic h for mats are supp orted a nd how ob ject types map to a par ticular format is at the discretion of the server implemen tation. Not ev er y forma t can supp ort a n y ob ject type. F or example, csv can only b e used to retrieve tabular data structures and png is only a ppropriate for gr aphics. T a ble 3 lists the formats Open CPU suppo rts, the resp ectiv e internet media type, and the R function that Open CPU uses to exp ort an ob ject in to a particular for mat. Arguments of the GET re quests a re mapp ed to this exp ort function. The pn g format has par ameters such as wi dth and heigh t as do cumented in ?png , wherea s the ta b forma t has para meters sep , eo l , dec which sp ecify the delimiting, end-of-line and decimal character resp ectiv ely as do cument ed in ?write .table . 18 F ormat Content-typ e Exp ort function Example print text/ plain base:: print /ocpu/ cran/MAS S/R/rlm/print rda applicat ion/octet-stream base::sav e /ocpu/ cran/MAS S/data/cats/rda rds applicat ion/octet-stream base::sav eRDS /ocpu/ cran/MAS S/data/cats/rds json a pplicatio n/json jsonli te::toJS ON /ocpu/ cran/MAS S/data/cats/json pb applic ation/x- p ro tobuf RProto Buf::ser ialize pb /ocpu /cran/MAS S/data/cats/pb tab text/pla in utils::wr ite.table /ocpu /cran/MAS S/data/cats/tab csv text/csv utils: :write.c sv /ocpu/ cran/MAS S/data/cats/csv png image/pn g grDevi ces::png /ocpu/ tmp/ { key } /graphics/1/png pdf applicat ion/pdf grDevi ces::pdf /ocpu/ tmp/ { key } /graphics/1/pdf svg image/sv g+xml grDevi ces::svg /ocpu/ tmp/ { key } /graphics/1/svg T able 3 : Cur r en tly s upported ex port forma ts a nd corr esponding Co ntent-typ e 4.6 URLs The ro ot of the A PI is dynamic, but defaults to / ocpu/ in the current implemen tation. Clients should make the OpenCPU s e r v er a ddress and ro ot path config urable. In the examples w e assume the defaults. As discussed befo re, O penCPU currently implements tw o container t yp es to hold resour ces. T able 4 lists the URL s o f the p ackage container type, which includes ob jects, data , manual pag es and files. Path Descrip tion Examples . Pac k a ge information /ocpu/ cran/MAS S/ ./R Exp orted namespa ce ob jects /ocpu/ cran/MAS S/R/ /ocpu/ cran/MAS S/R/rlm/print ./data Data ob jects in the pack age ( HTTP GET o nly) / ocpu/cra n/MASS/data/ /ocpu/ cran/MAS S/data/cats/json ./man Manu al pages in the pack ag e ( HTTP GET o nly) /ocpu/ cran/MAS S/man/ /ocpu/ cran/MAS S/man/rlm/html ./* Files in installation dire ctory , rela tive to pack ag e the ro ot /ocpu/ cran/MASS /NEWS /ocpu/ cran/MAS S/scripts/ T able 4 : The pack a g e container includes o b jects, data, manual pages a nd files. T able 5 lis ts URL s of the session container type. This co n tainer holds outputs gener ated from a RPC request and includes ob jects, graphics, s ource co de, stdo ut a nd files. As noted earlier, the distinction betw een pack ages and sessio ns is considered an implementation detail. The API do es not differentiate b etw een o b jects and files that a ppear in pack ag es or in ses sions. 19 Path Description Examples . Session conten t list /ocpu/ tmp/ { key } / ./R Ob jects created by the RPC request /ocpu/ tmp/ { key } /R/ /ocpu/ tmp/ { key } /R/mydata/json ./grap hics Graphics created by the RPC req ue s t /ocpu/ tmp/ { key } /graphics/ /ocpu/ tmp/ { key } /graphics/1/png ./sour ce Source co de of RPC request /ocpu/ tmp/ { key } /source ./stdo ut STDOUT from by the RPC req uest /ocpu/ tmp/ { key } /stdout ./cons ole Mixed source and S TDOUT emulating cons ole output /o cpu/tmp/ { key } /console ./file s/* Files sav ed to working dir by the RPC re quest /ocpu/ tmp/ { key } /files/myfile.xyz T able 5 : The ses sion container includes ob jects, g r aphics, sour c e , stdout and files . 4.7 RPC r equests A POST reque s t in OpenC PU always inv okes a r emote pro cedure c a ll ( RP C ). Requests tar geting a function ob ject result in a function call where the HTT P para meters from the p ost bo dy are mapped to function ar gu ments . A PO ST ta rgeting a script results in exe c ution of the scr ipt wher e HTTP parameters ar e pas sed to the script int er pr eter. The term RPC refers to b o th remote function c a lls and remote script exec utio ns. The current Open CPU implemen tation recog nizes scripts b y their file extension, and suppor ts R , latex , markd own , Sweave and kn itr scripts. T able 6 lists each script type with the r e spective file extensio n and interpreter. File ex tension T yp e Interpr eter file.r R evalua te::eval uate file.t ex L A T E X tools: :texi2pd f file.r nw knitr / swe ave kn itr::knit + tool s::texi2p df file.m d markdo wn knitr: :pandoc file.r md knitr markdown knitr::kn it + kn itr::pand oc file.b rew brew brew:: brew T able 6 : Files r ecognized a s scr ipts and their character izing file extensio n An imp ortant conceptual difference with a terminal in terface is tha t in the Open CPU API , the s e rv er determines the na mespace that o utput of a function call is ass igned to. The server includes a temp or ary key in the R PC resp onse that serves the same role a s a v ariable name. The key a nd is used to reference the newly created resource s in future requests. Besides the return v alue, the server also stores gr aphics, files, warnings, mes sages and std out that were cr eated by the RPC . These can b e listed and r etriev ed using the same key . In R , the function call itself is also an o b ject which is added to the collection for r eproducibility pur p oses. Ob jects o n the system are non-mutable and therefore the client ca nnot change or ov er write ex isting keys. 20 F or functions that mo dify the state of an ob ject, the ser ver creates a co p y of the modified reso urce with a new key a nd leaves the orig inal unaffected. 4.8 Argumen ts Arguments to a remo te function ca ll can b e p osted using one of sev er al metho ds. A data in terchange format suc h as JS ON or Protoco l Buffers can be used to directly p ost da ta structures such as lists, v ector s, matrices or data fra mes. Alterna tiv ely the clie nt can reference the name or key of a n exis ting ob ject. The server automatically r esolves keys and conv erts interchange formats into ob jects to b e used as a rgumen ts in the function call. Files contained in a multip art/form -data payload o f an RPC request are copied to the working directory a nd the a rgumen t of the function call is set to the filename. Thereby , remote function calls with a file arguments can b e per formed using standard HTML form submission. Content-typ e Primitives Data str u ctur es R aw c o de Files T emp key multip art/form -data OK OK (inline j son ) O K OK OK applic ation/x- www-form-urlencoded OK OK (inline j son ) O K - OK applic ation/js on OK OK - - - applic ation/x- protobuf OK OK - - - T able 7 : Accepted request Con tent-type s and supp orted a rgumen t formats The cur ren t implemen tation suppor ts several standar d Content -type formats for pas sing arguments to a re- mote function call within a POST request, including app lication/ x-www-form-urlencoded , multi part/form -data , applic ation/js on and a pplicatio n/x-protobuf . Each parameter or top level field within a POST pay- load contains a single argument v alue. T able 7 shows a matrix suppor ted argument for mats for each Conten t-types . 4.9 Priv acy Because the data and so urces of a statistical ana lysis include p otentially sensitive information, the temp orary keys fro m RPC reques ts are pr iv ate. Clien ts should default to keeping these keys secret, given that lea king a key will compromise co nfiden tiality o f their data. The system do es not allow clients to search for keys or retrieve resources without providing the appropria te key . In this sens e, a tempor ary k ey has a similar status as an ac c ess token . Because temp orary keys ar e priv ate, multiple users can shar e a single Open CPU server without any form of authentication. Each request is ano n ymous and confidential, and only the client that per formed the RPC has the key to access reso urces from a par ticular r equest. How ever, temp orary keys do not have to b e kept priv ate p er se: clients ca n c ho ose to exch ang e keys with other clients. Unlike typical access tokens, the keys in O penCPU are unique for e ac h re q uest. Hence by publishing a particular key , the client reveals only the resour ces fro m a sp ecific RPC r equest, and no other confidential information. Resources in OpenC PU are not tied to any par ticular user, in fact, there are no users in OpenC PU 21 system itself. Clien ts ca n share o b jects, gra phics or files with each other, simply by communicating keys to these r esources. Be c a use e a c h key holds bo th the output as well as the s ources for a n RP C r e quest, shared ob jects are r eusable and repro ducible by design. In s ome sense, all clien ts share a single univ ersa l namespace with keys c on taining hidden ob jects from all RPC r equests. By knowing a key to a particular res o urce it can be used as a n y other ob ject on the sys tem. This shap es the contours of a s ocial analysis pla tfor m in which users collab orate by sharing repr oducible, reusa ble reso urces iden tified by unique keys. References A. M. Kuchling. F unctional pro gramming. Python Do cumentat ion , 2014. URL http: // docs.pytho n. org/ 2/howto/funct ional. html . Release 0.3 1. John M Chambers, Ma rk H Hansen, David A J a mes, and Duncan T emple L a ng. Distributed computing with data: A co rba-based a pproach. Computing Scienc e and Statist ics , 1998. David B. Dahl and Sco tt Cr awford. Rinruby: Accessing the r interpreter from pur e rub y . Journal of Statistic al Softwar e , 29(4):1– 18, 1 200 9. ISSN 1548 -7660. URL h ttp://www.jstatsoft . or g/ v29 / i04 . Dirk Eddelbuettel and Romain F r ancois. Rcpp: Seamless r and c++ in tegration. Journal of Statistic al Softwar e , 40(8):1– 18, 4 201 1. ISSN 1548 -7660. URL h ttp://www.jstatsoft . or g/ v40 / i08 . Eric Ev ans. Domain-Driven Design: T ackling Complexity in the He art of S of twar e . Addison-W esley Profes- sional, 1 edition, 8 2003 . ISBN 9 7 8032112 5217. URL ht tp:// am azon. co m/o/ASIN/0321125215 / . R. Fielding, J. Gettys, J. Mogul, H. F ry st yk, L. Masinter, P . Lea c h, and T. B e rners-Lee. Hypertext T rans- fer Proto col – HTTP/1.1 . RFC 26 1 6 (Draft Standar d), June 199 9. URL ht tp:// www . iet f. org /rfc/ rfc261 6. txt . Up da ted by RFCs 281 7, 5785 , 6266, 6585 . Roy T Fielding. Rest apis must be hypertext- driv en. U ntangle d mu sings of R oy T. Fielding , 2 008. URL http:/ / roy.gbiv.com/untan gled/ 2008 / rest - apis- mu st- be- hypertext- driven . Roy Thomas Fielding. Ar chite ctur al Styles and t he Design of Network-b ase d Softwar e Ar chite cture s . PhD thesis, 20 0 0. URL https: // www.ics.uci.edu/ ~ fieldi ng/ pubs/disser tation/ top.htm . AAI9980 887. John F r anks, P Hallam-B ak er, J Ho stetler, S L awrence, P Leach, Ari Luotonen, and L Stew ar t. RFC 2617 : HTTP Authen tication: Basic and Digest Access Authentication, 19 99. URL h ttps:// too ls.ietf.org/ html/rfc26 17 . L Gautier . rpy2: A simple and efficient ac c ess to R fr om Python , 20 1 2. URL http: // rpy.source forge. net/rpy2.html . D. Hardt. The OAuth 2.0 Authoriza tion F ramework. RFC 67 49 (Prop osed Standard), Octob er 2012. URL http:/ / www.ietf.org/rfc/rfc674 9. txt . Richard M. Heib erger and Erich Neuwirth. R Thr ough Exc el: A Spr e adshe et Interfac e for Statistics, Data Analy sis, and Gr aphics (Use R!) . Springe r, 2009 edition, 8 20 0 9. ISBN 9781 44190051 7. URL http: // amazon . com/o/ASIN/1441 900519/ . 22 George T. Heineman and William T. Co uncill. Comp onent-Base d Softwar e Engine ering: Putting the Pie c es T o gether . Addison-W es ley Professiona l, 6 20 01. ISBN 97 80201704 853. URL http: // amazo n. com / o/ASIN / 020170 4854/ . John L. Hennessy and Da vid A. P atters on. Computer arc hitecture: A quan titative a pproach (the mor- gan k a ufmann ser ies in computer a r c hitecture and desig n), 10 2 011. URL htt p:// ama zon.com/o/ AS IN/ B0067K U84U/ . Michi Henning. The rise and fall of cor ba . Q ueue , 4(5 ):28–34, 200 6. J. Hor ner and D Eddelbuettel. littler: a scripting fr ont-end for GNU R. littler version 0.1.5 , 2011 . URL http:/ / dirk.eddelb uettel. com/code/littl er. html . Jeffrey Horner. RAp ache: Web applic ation development with R and A p ache , 2013 . URL http :// www. rapach e. net . Max Kuhn. CRAN T ask View: R epr o ducible Re se ar ch , 2014. URL htt p:// cran . r- p roject. or g/web/ views/Repr oducibleR esearch. ht ml . R Core T ea m. R: A L anguage and Envir onment for Statistic al Computing . R F oundatio n for Statistical Computing, Vienna, Austria, 2014 . URL ht tp://www.R- project . or g/ . Chris Rea de. Elements O f F un ctional Pr o gr amming (International Co mput er Scienc e Series) . Addison- W esley , 1 edition, 1 19 89. ISBN 97 80201129 151. URL ht tp:// am azon. com /o/ASIN/0201129159 / . RStudio Inc. httpuv: HTTP and WebSo cket server libr ary , 201 4a. URL http: // CRAN.R- pr oject. org/ packag e=httpuv . R pack ag e version 1.3.0 . RStudio Inc. shiny: Web Applic ation F r amework for R , 2014b. URL htt p:// CRA N. R- project . org / packag e=shiny . R pack age version 0 .9.1. Duncan T emple La ng. The omega hat environmen t: New po ssibilities for statistica l co mputing. Journal of Computational and Gr aphic al Statistics , 9(3):4 23–451, 200 0. Simon Urbanek. rJava: L ow-level R to Java interfac e , 20 1 3a. URL http: // CRAN.R- pr oject. org/ packag e=rJava . R pack age version 0 .9-6. Simon Urbanek. Rserve: Bi nary R server , 2013b. URL http :// CRAN.R- pr oject. org / pa ckage=Rse rve . R pack age version 1.7- 3 . Stef v a n Buuren and Jer oen CL Ooms. Stage line diagr a m: An ag e-conditional refere nc e diagr am for tracking developmen t. Statistics in me dicine , 28 (11):1569–15 79, 2009. URL ht tp:// on linelibra ry. wiley . com/ doi/10.1002/sim.3567/abstra ct . W. N. V e na bles a nd B . D. Ripley . Mo dern Applie d Statistics with S . Springer, New Y o rk, fourth edition, 2002. URL ht tp:// www . sta ts. ox .ac. uk/pub / MASS 4 . ISBN 0-387 -95457-0 . 23

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment