Protein Models Comparator: Scalable Bioinformatics Computing on the Google App Engine Platform

Pr otein Models Comparator Scalab le Bioinf ormatics Computing on the Google App Engine Platf orm P a weł Widera and Natalio Krasnogor School of Computer Science, Univ ersity of Nottingham, UK Contact: a m . io g no t n r g t o k . r s @ a i k u a l a h t n c o . a n Backgr ound: The comparison of computer generated protein structural mo dels is an imp ortan t elemen t of protein structure prediction. It has man y uses including model quality ev aluation, selection of the ﬁnal models from a large set of candidates or optimisation of parameters of energy functions used in template-free mo delling and reﬁnement. Although many protein comparison metho ds are av ailable online on numerous w eb servers, they are not well suited for large scale mo del comparison: (1) they op erate with metho ds designed to compare actual proteins, not the mo dels of the same protein, (2) ma jority of them oﬀer only a single pairwise structural comparison and are unable to scale up to a required order of thousands of comparisons. T o bridge the gap b et w een the protein and mo del structure comparison w e hav e developed the Protein Mo dels Comparator (pm-cmp). T o b e able to deliver the scalability on demand and handle large comparison exp eriments the pm-cmp was implemented “in the cloud”. Results: Protein Mo dels Comparator is a scalable w eb application for a fast distributed comparison of protein mo dels with RMSD, GDT TS, TM-score and Q-score measures. It runs on the Go ogle App Engine cloud platform and is a sho w case of ho w the emerging PaaS (Platform as a Service) technology could b e used to simplify the dev elopmen t of scalable bioinformatics services. The functionality of pm-cmp is accessible through API which allows a full automation of the e xp erimen t submission and results retriev al. Protein Mo dels Comparator is a free soft w are released under the Aﬀero GNU Public Licence and is av ailable with its source co de at: http://www.infobiotics.org/pm- cmp Conclusions: This article presents a new web application addressing the need for a large-scale mo del-sp eciﬁc protein structure comparison and provides an insigh t into the GAE (Google App Engine) platform and its usefulness in scientiﬁc computing. P rotein structure comparison seems to b e most suc- cessfully applied to the functional classiﬁcation of newly discov ered proteins. As the evolutionary con- tin uit y b etw een the structure and the function of proteins is strong, it is p ossible to infer the function of a new pro- tein based on its structural similarity to known protein structures. This is, how ev er, not the only application of structural comparison. There are several asp ects of pro- tein structure prediction (PSP) where robust structural comparison is very imp ortant. The most common application is the ev aluation of mo d- els. T o measure the quality of a model, the predicted structure is compared against the target native structure. This type of ev aluation is p erformed on a large scale dur- ing the CASP exp eriment (Critical Assessment of protein Structure Prediction), when all mo dels submitted b y dif- feren t prediction groups are ranked b y the similarity to the target structure. Dep ending on the target category , which could b e either a template-based mo delling (TBM) target or a free mo delling (FM) target, the comparison emphasis is put either on lo cal similarit y and identiﬁcation of well predicted regions or global distance b et w een the mo del and the nativ e structure [1 – 3]. The CASP ev aluation is done only for the ﬁnal mo dels submitted by eac h group. These mo dels hav e to b e se- lected from a large set of computer generated candidate structures of unknown qualit y . The most promising mo dels are commonly chosen with the use of clustering techniques. First, all mo dels are compared against each other and then, split into several groups of close similarity (clusters). The most representativ e elements of each cluster (e.g. cluster cen troids) are selected as ﬁnal mo dels for submission [4, 5]. P . Widera and N. Krasnogor • Protein Models Comp ara tor page 1 of 8 The generation of mo dels in the free mo delling category , as well as the pro cess of mo del reﬁnement in b oth FM and TBM categories, requires a well designed protein energy function. As it is b elieved that the native structure is in a state of thermo dynamic equilibrium and low free energy , the energy function is used to guide the structural search to w ards more native-lik e structures. Ideally , the energy function should hav e low v alues for mo dels within small structural distance to the nativ e structure, and high v alues for the most distinct and non-protein-like mo dels. T o ensure such prop erties, the parameters of energy functions are carefully optimised on a training set of models for which the real distances to the native structures are precomputed [6 – 9]. Model comparison vs. protein alignment All these three asp ects of prediction: ev aluation of mo dels qualit y , selection of the b est mo dels from a set of candi- dates and the optimisation of energy functions, require a signiﬁcan t num ber of structural comparisons to b e made. Ho w ev er, this comparisons are not made b etw een tw o pro- teins, but b etw een t w o protein mo dels that are structural v ariants of the same protein and are comp osed of the same set of atoms. Because of that, the alignment b etw een the atoms is known a priori and is ﬁxed, in contrast to compar- ison b et w een tw o diﬀerent proteins where the alignmen t of atoms usually has to b e found b efore scoring the structural similarit y . Ev en though searching for optimal alignment is not nec- essary in mo del comparison, assessing their similarity is still not straigh tforw ard. Additional complexity is caused in practice by the incompleteness of mo dels. F or example, man y CASP submitted mo dels contain the atomic co ordi- nates for just a subset of the protein sequence. Often even the nativ e structures ha v e several residues missing as the X-ra y crystallography exp eriments not alwa ys lo cate all of them. As the mo del comparison measures op erate only on the structures of equal length, a common set of residues ha v e to b e determined for each pair of mo dels b efore the comparison is p erformed (see Figure 1). It should b e noted that this is not an alignment in the traditional sense but just a matc hing pro cedure that selects the residues present in b oth structures. Figure 1. Matching common residues b etw een tw o structures. There are tw o common cases when num b er of residues diﬀers b etw een the structures: (A) some residues at the b eginning/end of a protein sequence were not lo cated in the crystallography exp eriment and (B) structure was deriv ed from templates that did not cov er the en tire protein sequence. In b oth cases pm-cmp p erforms a comparison using the maxim um common subset of residues. Comparison servers Although many protein structure comparison w eb services are already a v ailable online, they are not well suited for mo dels comparison. Firstly , they do not op erate on a scale needed for such a task. Commonly these metho ds oﬀer a simple c omparison b et w een tw o structures (1:1) or in the b est case, a comparison b etw een a single structure and a set of known structures extracted from the Protein Data Bank (1:PDB). While what is really needed is the ability to compare a large num ber of structures either against a kno wn nativ e structure (1:N) or against each other (N:N). Secondly , the comparison itself is done using just a single comparison metho d, which may not b e reliable enough for all the cases (types of proteins, sizes etc.). An exception to this is the ProCKSI server [10] that uses sev eral diﬀeren t comparison metho ds and provides 1:N and N:N comparison mo des. Ho wev er, it operates with metho ds designed to compare real proteins, not the mo dels generated in the pro cess of PSP , and therefore it lac ks the ability to use a ﬁxed alignment while scoring the structural similarit y . Also the high computational cost of these metho ds makes large-scale comparison exp eriments diﬃcult without a supp ort of grid computing facilities (see our previous work on this topic [11, 12]). The only server able to p erform a large-scale model- sp eciﬁc structural comparison we are aw are of, is the in- frastructure implemented to supp ort the CASP exp eri- men t [13]. This service, how ev er, is only av ailable to a small group of CASP assessors for the purp ose of ev alua- tion of the predictions submitted for a current edition of CASP . It is a closed and proprietary system that is not publicly av ailable neither as an online server nor in a form of a source code. Due to that, it cannot b e freely used, replicated or adapted to the sp eciﬁc needs of the users. W e ha v e created the Protein Mo dels Comparator (pm-cmp) to address these issues. Google App Engine W e implemented pm-cmp using the Google App Engine (GAE) [14], a recently in troduced web application platform designed for scalability . GAE op erates as a cloud comput- ing environmen t providing Platform as a Service (PaaS), and remo v es the need to consider ph ysical resources as they are automatically scaled up as and when required. An y individual or a small team with enough programming skills can build a distributed and scalable application on GAE without the need to sp end any resources on the setup and main tenance of the hardware infrastructure. This w a y , sci- en tist freed from tedious conﬁguration and administration tasks can fo cus on what they do b est, the science itself. GAE oﬀers tw o runtime environmen ts based on Python or Jav a. Both environmen ts oﬀer almost identical set of platform services, they only diﬀer in maturity as Jav a en vironmen t has b een introduced 12 mon ths after ﬁrst preview of the Python one. The en vironmen ts are well do cumen ted and frequently up dated with new features. A limited amount of GAE resources is provided for free and is enough to run a small application. This limits are consequen tly decreased with each release of the platform SDK (Softw are Developmen t Kit) as the stabilit y and P . Widera and N. Krasnogor • Protein Models Comp ara tor page 2 of 8 Figure 2. Application control ﬂow. The in teraction with a user is divided into 4 steps: setup of the exp eriment options, upload of the structural mo dels, start of the computations and ﬁnally download of the results when ready . p erformance issues are ironed out. There are no set-up costs and all paymen ts are based on the daily amount of resources (storage, bandwidth, CPU time) used ab ov e the free levels. In the next sections w e describ e the ov erall architecture and functionality of our web application, exemplify several use cases, present the results of the performance tests, discuss the main limitations of our work and p oint out a few directions for the future. Implementation The pm-cmp application enables users to set up a compari- son experiment with a c hosen set of similarit y measures, up- load the protein structures and download the results when all comparisons are completed. The interaction b etw een pm-cmp and the user is limited to four steps presen ted in Figure 2. Application arc hitecture The user interface (UI) and most of the application logic w as implemented in Python using the web2p y framework [15]. Because w eb2p y provides an abstraction la yer for data access, this co de is p ortable and could run outside of the GAE infrastructure with minimal changes. Thanks to the syn tax brevity of the Python language and the simplicit y of web2p y constructs the pm-cmp application is also very easy to extend. F or visualisation of the results the UI mo dule uses Flot [16], a Jav aScript plotting library . The comparison engine w as implemen ted in Gro ovy using Gaelyk [17], a small ligh t w eigh t web framework designed for GAE. It runs in Jav a Virtual Mac hine (JVM) environmen t and interfaces with the BioShell jav a library [18] that implemen ts a num ber of structure comparison metho ds. W e decided to use Gro ovy for the ease of dev elopmen t and Python-like programming exp erience, esp ecially that a dedicated GAE framework (Gaelyk) already existed. W e did not use any of the en terprise level Jav a framew orks suc h as Spring, Strip es, T ap estry or Wic k et as they are more complex (often require an sophisticated XML-based conﬁguration) and were not fully compatible with GAE, due to sp eciﬁc restriction of its JVM. How ev er, recen tly a n um ber of w ork arounds hav e b een introduced to make some of this frameworks usable on GAE. The comm unication betw een the UI module and the comparison engine is done with the use of HTTP request. The request is sent when all the structures hav e been uploaded and the exp erimen t is ready to start (see Figure 3). The comparison mo dule organises all the computational w ork required for the exp eriment into small tasks. Each task, represen ted as HTTP request, is put into a queue and later automatically dispatched by GAE according to the deﬁned scheduling criteria. Figure 3. Protein Mo dels Comparator architecture. The application GUI was implemen ted in the GAE Python en vironmen t. It guides the user through the setup of an exp erimen t and then sends HTTP request to the comparison engine to start the computations. The comparison engine was implemen ted in the GAE Jav a environmen t. Distribution of tasks The task execution on GAE is scheduled with a token buc k et algorithm that has tw o parameters: a buck et size and a buck et reﬁll rate. The n um ber of tok ens in the buck et limits the num ber of tasks that could b e executed at a giv en time (see Figure 4). The tasks that were executed in parallel run on the separate instances of the application in the cloud. New instances are automatically created when needed and deleted after a p erio d of inactivity whic h enables the application to scale dynamically to the current demand. Our application uses tasks primarily to distribute the computations, but also for other bac kground activities like deletion of uploaded structures or old experiments data. The computations are distributed as separate structure vs. structure comparison tasks. Each task reads the structures previously written to the datastore by the UI mo dule, P . Widera and N. Krasnogor • Protein Models Comp ara tor page 3 of 8 Figure 4. T ask queue management on Go ogle App Engine. A) 8 tasks has b een added to a queue. The tok en buc k et is full and has 3 tokens. B) T ok ens are used to run 3 tasks and the buck et is reﬁlled with 2 new tokens. p erforms the comparison and stores bac k the results. This pro cedure is slightly optimised with a use of the GAE memcac he service and eac h time a structure is read for the second time it is serv ed from a fast lo cal cache instead of b eing fetched from the slow er distributed datastore. Also to minimise the num ber of datastore reads all selected measures are computed together in a single task. The comparison of t wo structures starts with a searc h for the common C α atoms. Because the comparison metho ds require b oth structures to b e equal in length, a common atomic denominator is used in the comparison. If required, the total length of the mo dels is used as a reference for the similarit y scores, so that the score of a partial matc h is prop ortionally low er than the score of a full length matc h. This approac h makes the comparison very robust, even for mo dels of diﬀerent size (as long as they share a n um ber of atoms). Results The pm-cmp application pro vides a clean interface to deﬁne a comparison exp eriment and upload the protein structures. In eac h exp erimen t the user can choos e whic h measures and what comparison mo de (1:N or N:N) should b e used (see Figure 5). Currently , four structure comparison measures are implemented: RMSD, GDT TS [19], TM-score [20] and Q-score [21]. These are the main measures used in ev aluation of CASP mo dels. Additionally , a user can choose the scale of reference for GDT TS and TM-score. It could be the num ber of matc hing residues or the total size of the structures b eing compared. It c hanges the results only if the models are incomplete. The ﬁrst option is useful when a user is in- terested in the similarity score regardless of the num b er of residues used in comparison. F or example, she submits incomplete mo dels containing only co ordinates of residues predicted with high conﬁdence and wan ts to kno w ho w go o d these fragments are alone. On the other hand, a user might wan t to take into account all residues in the structures being compared, not just the matc hing ones. F or that, she would use the second option where the similarity score is scaled by the length of the target structure (in 1:N comparison mo de) or by the length of the shorter structure from a pair b eing compared (in N:N comparison mo de). This wa y a short fragment with a p erfect matc h will hav e a low er score than a less p erfect full-length match. After setting up the experiment, the next step is the upload of mo dels. This is done with the use of Flash to allo w m ultiple ﬁle uploads. The user can track the progress of the upload pro cess of each ﬁle and the n um ber of ﬁles left in the upload queue. When the upload is ﬁnished a user can start the computations, or if needed, upload more mo dels. The current status of recently submitted exp eriments is sho wn on a separate page. Instead of chec king the status there, a user can provide an e-mail address on exp eriment setup to b e notiﬁed when the exp eriment is ﬁnished. The results of the exp eriment are presented in a form of interac- tiv e histograms showing for each measure the distribution of scores across the mo dels (see Figure 6). Also a raw data ﬁle is provided for download and p ossible further analysis (e.g. clustering). In case of errors the user is notiﬁed by e-mail and a detailed list of problems is giv en. In most cases errors are caused by inconsistencies in the set of mo d- els, e.g. lac k of common residues, use of diﬀerent chains, mixing mo dels of diﬀerent proteins or non-conformance to the PDB format. Despite the errors, the partial results are still av ailable and contain all successfully completed comparisons. There are three main adv an tages of pm-cmp ov er the existing online services for protein structure comparison. First of all, it can work with multiple structures and run exp erimen ts that may require thousands of pairwise com- Figure 5. Exp erimen t setup screen. T o set up an exp erimen t the user has to choose a lab el for it, optionally pro vide an e-mail address (if she w an ts to b e notiﬁed ab out the exp erimen t status), select one or more comparison measures, and choose the comparison mo de (1:N or N:N) and the reference scale. P . Widera and N. Krasnogor • Protein Models Comp ara tor page 4 of 8 URL Metho d P arameters Return /experiments/setup POST label - string 303 Redirect measures - subset of [ RSMD , GDT TS , TM-sc or e , Q-sc or e ] mode - ﬁrst against al l or al l against al l scale - match length or total length /experiments/structures/[id] POST file - multipart/form-data enco ded ﬁle HTML link to the uploaded ﬁle /experiments/start/[id] GET - 200 OK /experiments/status/[id] GET - status in plain text /experiments/download/[id] GET - results ﬁle T able 1. Description of the RESTful interface of pm-cmp. Figure 6. Example of distribution plots. F or a quick visual assessment of models diversit y the results of comparison are additionally presented as histograms of the similarit y/distance v alues. parisons. Secondly , these comparisons are p erformed cor- rectly , even if some residues are missing in the structures, thanks to the residue matching mechanism. Thirdly , it in tegrates sev eral comparison measures in a single service giving the users an option to choose the asp ect of similarity they wan t to test their mo dels with. Application Programming Interface (API) As Protein Mo dels Comparator is build in the REST (REp- resen tational State T ransfer) architecture, it is easy to access programmatically . It uses standard HTTP metho ds (e.g. GET, POST, DELETE) to provide services and com- m unicates back the HTTP resp onse co des (e.g. 200 - OK, 404 - Not F ound) along with the con ten t. By using the RESTful API summarised in T able 1, it is possible to set up an exp eriment, upload the mo dels, start the computations, c hec k the exp eriment status and download the results ﬁle automatically . W e pro vide pm-cmp-bot.py , an example of a Python script that uses this API to automate the exp erimen t submission and results retriev al. As w e w an ted to k eep the script simple and readable, the handling of connection problems is limited to the most I/O intensiv e upload part and in general the script does not retry on error, verify the resp onse, etc. Despite of that, it is a fully functional to ol and it was used in several tests describ ed in the next section. P erf ormance tests T o examine the p erformance of the prop osed architecture w e ran a 48h test in whic h a group of beta testers ran m ultiple exp eriments in parallel at diﬀerent times of a da y . As a benchmark we used the mo dels generated b y I-T ASSER [22], one of the top prediction metho ds in the last three editions of CASP . F rom each set con taining every 10th structure from the I- T ASSER sim ulation timeline w e selected the top n mo dels, i.e. the closest to the nativ e by means of RMSD. The n um ber of models was c hosen in relation to the protein length to obtain one small, tw o medium and one large size exp erimen t as shown in T able 2. The smallest exp eriment w as four times smaller the the large one and t w o times smaller than the medium one. protein 1b72A 1kviA 1egxA 1fo5A (mo dels*length) (350x49) (500x68) (300x115) (800x85) total size 17150 34000 34500 68000 T able 2. F our sets of protein mo dels used in the p erformance b enc hmark (av ailable for download on the pm-cmp website). W e observed a v ery consistent b ehaviour of the applica- tion, with a relative absolute median deviation of the total exp erimen t pro cessing time smaller than 10%. The v alues rep orted in T able 3 show the statistics for 15 runs p er each of the four sets of mo dels. The task queue rate was set to 4/s with a buck et size of 10. Whenev er execution of tw o exp erimen ts o v erlapped, w e accounted for this ov erlap by subtracting the waiting time from the execution time, so that the time sp en t in a queue while the other exp eriment w as still running was not counted. Using GAE 1.2.7 we w ere able to run ab out 30 exp eriments per da y sta ying within the free CPU quota. pro cessing time[s] protein mo dels length median mad ∗ min max 1b72A 350 49 178 17 108 272 1egxA 300 115 195 17 125 274 1kviaA 500 68 236 16 203 406 1fo5A 800 85 369 33 307 459 *) mad (median absolute deviation) = median i ( | x i − median ( X ) | ) T able 3. Results of the p erformance b enchmark. P . Widera and N. Krasnogor • Protein Models Comp ara tor page 5 of 8 T o test the scalability of pm-cmp w e ran additional tw o large exp eriments with appro ximately 2500 comparisons eac h (using GAE 1.3.8). W e used the mo dels generated b y I-T ASSER again: 2500 mo dels for [PDB:1b4bA] (ev ery 5th structure from the simulation timeline) and 70 mo dels for [PDB:2rebA2] (top mo dels from every 10th structure sample set). The results of 11 runs p er set are summarised in T able 4. All runs w ere separated b y a 15 minutes inactiv- it y time, to allow GAE to bring down all active instances. Th us each run activ ated the application instances from scratc h, instead of reusing instances activ ated by the previ- ous run. Because the exp eriments did not ov erlap and due to the use of more mature version of the GAE platform, the relativ e absolute median deviation w as muc h low er than in the ﬁrst p erformance b enc hmark and did not exceed 3.5%. processing time[s] experiment models length median mad min max 1b4bA (1:N 2501 cmp) 2500 71 838.00 25.00 746 903 2reb 2 (N:N 2415 cmp) 70 60 854.00 29.00 731 958 T able 4. Performance for large num b er of comparisons. T o relate the p erformance of our application to the p er- formance of the comparison engine executed lo cally w e conducted another test. This time we follow ed a t ypical CASP scenario and w e ev aluated 308 server submitted mo dels for the CASP9 target T0618 ([PDB:3nrhA]). The comparison against the target structure w as p erformed with the use of the pm-cmp-bot and t w o times were measured: exp erimen t execution time (as in previous test) and the to- tal time used b y pm-cmp-bot (including upload/download times). The statistics of 11 runs are rep orted in T able 5. As the exp eriments were p erformed in 1:N mo de the ﬁle upload pro cess to ok a substantial 30% of the total time. The lo cal execution of the comparison engine on a machine with In tel P8400 2.26GHz (2 core CPU) w as almost 5 times slo w er than the execution in the cloud. W e consider this to b e a signiﬁcant sp eed up, esp ecially having in mind the conserv ative setting of the task queue rate (4/s while GAE allo ws a maximum of 100/s). Our preliminary exp eriments with GAE 1.4.3 show ed that the sp eedup p ossible with the queue rate of 100 tasks p er second is at least an order of magnitude larger. pro cessing time[s] platform time median mad min max GAE total 135 4 127 146 GAE execution 89 2 86 97 lo cal execution 413 8 394 422 T able 5. Performance compared to lo cal execution. Discussion The pm-cmp application is a conv enien t to ol p erforming a comparison of a set of protein mo dels against a target structure (e.g. in mo del quality assessmen t or optimisation of energy functions for PSP) or against each other (e.g. in selection of the most frequently o ccurring structures). It is also an interesting show case of a scalable scien tiﬁc computing on the Go ogle App Engine platform. T o provide more inside on the usefulness of GAE in bioinformatics ap- plications in general, we discuss b elow the main limitations of our approach, p ossible work arounds and future work. Response time limit A critical issue in implementing an application working on GAE was to k eep the resp onse time to each HTTP request b elo w the 30s limit. This is why the division of w ork in to small tasks and extensive use of queues was required. Ho w ev er, this might b e no longer critical in the recent releases of GAE 1.4.x which allow the background tasks to run 20 times longer. In our application, where a single pairwise comparison with all metho ds never to ok longer than 10s, the task execution time was never an issue. The b ottlenec k was the task distribution routine. As it was not p ossible to read more than 1000 entities from a datastore within the 30s time limit, our application was not able to scale up abov e the 1000 comparisons per experiment. Ho w ev er, GAE 1.3.1 introduced the mechanism of cursors to tackle this very problem. That is, when a datastore query is p erformed its progress can b e stored in a cursor and resumed later. Our co de distribution routine simply call itself (b y adding self to the task queue) just b efore the time limit is reached and contin ue the processing in the next cycle. This wa y our application could scale up to thousands of models. How ev er, as it curren tly op erates within the free CPU quota limit, we do not allow very large exp eriments online yet. F or practical reasons w e set the limit to 5000 comparisons. This allows us to divide the daily CPU limit b et w een several smaller exp erimen ts, instead of ha ving it all consumed by a single large exp eriment. In the future w e would lik e to monitor the CPU usage and adjust the size of the exp eriment with resp ect to the amount of the resources left each day . Native code ex ecution Both environmen ts av ailable on GAE are build on inter- preted languages. This is not an issue in case of a standard w eb applications, how ever in scientiﬁc computing the eﬃ- ciency of the co de execution is v ery imp ortant (esp ecially in the con text of resp onse time limits mentioned ab ov e). A common practice of binding these languages with fast nativ e mo dules written in C/C++ is unfortunately not an option on GAE. No arbitrary native co de can b e run on the GAE platform, only the pure Python/Ja v a mo dules. Although Go ogle extends the num b er of native mo dules a v ailable on GAE it is rather unlikely that we will se e an ytime so on mo dules for fast numeric computation such as NumPy . F or that reason we implemented the compari- son engine on the Jav a Virtual Machine, instead of using Python. Bridging Python and Jav a Initially we w an ted to run our application as a single mo dule written in Jython (implementation of Python in Ja v a) that runs inside a Jav a Servlet and then bridge it with P . Widera and N. Krasnogor • Protein Models Comp ara tor page 6 of 8 w eb2p y framework to combine Python’s ease of use with the n umerical sp eed of the JVM. How ev er, we found that this is not p ossible without mapping all GAE Python API calls made b y web2p y framework to its Ja v a API correspondents. As the amount of work needed to do that exceeded the time w e had to work on the pro ject we attempted to join these tw o worlds diﬀerently . W e decided to implement it as tw o separate applications, each in its o wn environmen t, but sharing the same datastore. This was not p ossible as eac h GAE application can access only its own datastore. W e had to resort to the mec hanism of v ersions. It was designed to test a new version of an application while the old one is still running. Each version is indep endent from the others in terms of the used environmen t and they all share the same datastore. This might b e considered to b e a hack and not a v ery elegant solution but it work ed exactly as intended; we end up with tw o separate mo dules accessing the same data. Handling large ﬁles There is a hard 1MB limit on the size of a datastore en- tit y . The dedicated Blobstore service introduced in GAE 1.3.0 makes the upload of large ﬁles p ossible but as it was considered experimental and did not pro vide at ﬁrst an API to read the blob con ten t, we decided not to use it. As a consequence we could not use a simple approach of uploading all exp eriment data in a single compressed ﬁle. Instead, we decided to upload the ﬁles one by one directly to the datastore, since a single protein structure ﬁle is usu- ally muc h smaller than 1MB. T o mak e the upload pro cess easy and capable of handling h undreds of ﬁles, w e used the Uploadify library [23] which combines Jav aScript and Flash to pro vide a m ulti-ﬁles upload and progress trac king. Although since GAE 1.3.5 it is now p ossible to read the con ten t of a blob, the multiple ﬁle decompression still re- mains a complex issue b ecause GAE lac ks a direct access to the ﬁle system. It would be in teresting to inv estigate in the future if a task cycling technique (used in our distribution routine) could b e used to tackle this problem. V endor lock-in Although the GAE co de remains proprietary , the softw are dev elopmen t kit (SDK) required to run the GAE stack lo cally is a free softw are licensed under the Apache Licence 2.0. Information contained in the SDK co de allow ed the creation of tw o alternative free softw are implementations of the GAE platform: AppScale [24] and TyphoonAE [25]. The risk of vendor lo ck-in is therefore minimised as the same application co de could b e run outside of the Go ogle’s platform if needed. Comparison to other cloud platf orms GAE provides an infrastructure that automates muc h of the diﬃculties related to creating scalable w eb applications and is b est-suited for small and medium-sized applications. Applications that need high p erformance computing, access to relational database or direct access to op erating system primitiv es might b e b etter suited for more generic cloud computing framewor ks. There are tw o ma jor comp etitors to the Go ogle plat- form. Microsoft’s Azure Services are based on the .NET framew ork and pro vide a less constrained environmen t, but require to write more low level co de and do not guar- an tee scalability . Amazon W eb Services are a collection of lo w-lev el to ols that pro vide Infrastructure as a Service (IaaS), that is storage and hardw are. Users can assign a w eb application to as many computing units (instances of virtual machines) as needed. They also receive complete con trol ov er the mac hines, at the cost of requiring mainte- nance and administration. Similarly to Microsoft’s cloud, it do es not provide automated scalability , so it is clearly a trade-oﬀ b et w een access at a low er and unconstrained level and the scalability that has to b e implemented by the user. Additionally , b oth these platforms are fully paid services and do not oﬀer free/start-up resources. Conc lusions Protein Mo dels Comparator is ﬁlling the gap b et w een commonly oﬀered online simple 1:1 protein comparison and the non-public proprietary CASP large-scale ev aluation infrastructure. It has b een implemented using Go ogle App Engine platform that oﬀers automatic scalability on the data storage and the task execution level. In addition to a friendly user web interface, our service is accessible through REST-like API that allows full au- tomation of the exp eriments (we provide an example script for remote access). Protein Mo dels Comparator is a free soft w are, which means any one can study and learn from its source co de as well as extend it with his own mo diﬁcations or even set up clones of the application either on GAE or using one of the alternative platforms such as AppScale or T ypho onAE. Although GAE is a great platform for prototyping as it eliminates the need to set up and maintain the hardware, pro vides the resources on demand and automatic scalability , the task execution limit makes it suitable only for highly parallel computations (i.e. the ones that could b e split in to small independent ch unks of work). Also a lac k of direct disk access and inability to execute the native co de restricts the p ossible uses of GAE. How ev er, lo oking back at the history of changes it seems likely that in the future GAE platform will b ecome less and less restricted. F or example, the long running background tasks had b een on the top of the GAE pro ject roadmap [26] and recently the task execution limit was raised in GAE 1.4.x making the platform more suitable for scientiﬁc computations. Ac knowledgements W e would like to thank the fellow researchers who kindly dev oted their time to testing the pm-cmp: E. Glaab, J. Blak es, J. Smaldon, J. Chaplin, M. F ranco, J. Bacardit, A.A. Shah, J. Twycross and C. Garc ´ ıa-Mart ´ ınez. This work was supp orted by the Engineering and Phys- ical Sciences Research Council [EP/D061571/1]; and the Biotec hnology and Biological Sciences Research Council [BB/F01855X/1]. P . Widera and N. Krasnogor • Protein Models Comp ara tor page 7 of 8 References [1] Y. Zhang, “Progress and challenges in protein structure prediction,” Curr ent Opinion in Structur al Biolo gy , vol. 18, pp. 342–348, Jun 2008. doi:10.1016/j.sbi.2008.02.004 . [2] D. Cozzetto, A. Giorgetti, D. Raimondo, and A. T ramontano, “The Ev aluation of Protein Structure Prediction Results,” Mole cular Biote chnology , vol. 39, no. 1, pp. 1–8, 2008. doi:10.1007/s12033- 007- 9023- 6 . [3] A. Kryshtafo vych and K. Fidelis, “Protein structure prediction and model quality assessment,” Drug Disc overy T o day , vol. 14, pp. 386–393, Apr. 2009. doi:10.1016/j.drudis.2008.11.010 . [4] D. Shortle, K. T. Simons, and D. Baker, “Clustering of low-energy conformations near the native structures of small proteins,” Pr oc e e dings of the National A c ademy of Sciences of the United States of Americ a , vol. 95, pp. 11158–11162, Sept. 1998 [cited 2010-09-21]. [5] Y. Zhang and J. Skolnic k, “SPICKER: A clustering approac h to iden tify near-native protein folds,” J. Comput. Chem. , vol. 25, no. 6, pp. 865–871, 2004. doi:10.1002/jcc.20011 . [6] Y. Zhang, A. Kolinski, and J. Sk olnic k, “TOUCHSTONE I I: A New Approach to Ab Initio Protein Structure Prediction,” Biophys. J. , vol. 85, pp. 1145–1164, Aug. 2003. doi:10.1016/S0006- 3495(03)74551- 2 . [7] C. A. Rohl, C. E. M. Strauss, K. M. S. Misura, and D. Bak er, “Protein Structure Prediction Using Rosetta,” in Numeric al Computer Metho ds, Part D (L. Brand and M. L. Johnson, eds.), vol. V olume 383 of Methods in Enzymolo gy , pp. 66–93, Academic Press, Jan. 2004. doi:10.1016/S0076- 6879(04)83004- 0 . [8] P . Widera, J. Garibaldi, and N. Krasnogor, “GP challenge: evolving energy function for protein structure prediction,” Genetic Pr ogr amming and Evolvable Machines , v ol. 11, pp. 61–88, Marc h 2010. doi:10.1007/s10710- 009- 9087- 0 . [9] J. Zhang and Y. Zhang, “A Nov el Side-Chain Orientation Dep endent Potential Derived from Random-Walk Reference State for Protein Fold Selection and Structure Prediction,” PLoS ONE , v ol. 5, p. e15386, Oct. 2010. doi:10.1371/journal.pone.0015386 . [10] D. Barthel, J. D. Hirst, J. Blazewicz, and N. Krasnogor, “ProCKSI: A Decision Supp ort System for Protein (Structure) Comparison, Knowledge, Similarity and Information,” BMC Bioinformatics , vol. 8, no. 1, p. 416, 2007. doi:10.1186/1471- 2105- 8- 416 . [11] G. F olino, A. Shah, and N. Kransnogor, “On the storage, management and analysis of (multi) similarity for large scale protein structure datasets in the grid,” in 22nd IEEE International Symp osium on Computer-Base d Me dic al Systems (CBMS 2009) , pp. 1–8, aug 2009. doi:10.1109/CBMS.2009.5255328 . [12] A. Shah, G. F olino, and N. Krasnogor, “Tow ard High-Throughput, Multicriteria Protein-Structure Comparison and Analysis,” IEEE T r ansactions on NanoBioscienc e , v ol. 9, pp. 144–155, jun 2010. doi:10.1109/TNB.2010.2043851 . [13] A. Kryshtafo vyc h, O. Krysko, P . Daniluk, Z. Dmytriv, and K. Fidelis, “Protein structure prediction center in CASP8,” Pr oteins , vol. 77, no. S9, pp. 5–9, 2009. doi:10.1002/prot.22517 . [14] “Go ogle App Engine,” [online, cited 2009-11-06]. [15] M. D. Pierro. “web2p y web framework,” [online, cited 2009-11-06]. [16] O. Laursen. “Flot - Jav ascript plotting library for jQuery ,” [online, cited 2011-02-18]. [17] M. Overdijk and G. Laforge. “Gaelyk - light w eight Gro ovy to olkit for Go ogle App Engine,” [online, cited 2009-11-06]. [18] D. Gront and A. Kolinski, “Utilit y library for structural bioinformatics,” Bioinformatics , vol. 24, pp. 584–585, F eb. 2008. doi: 10.1093/bioinformatics/btm627 . [19] A. Zemla, “LGA: a method for ﬁnding 3D similarities in protein structures,” Nucl. Acids Res. , vol. 31, no. 13, pp. 3370–3374, 2003. doi:10.1093/nar/gkg571 . [20] Y. Zhang and J. Sk olnic k, “Scoring function for automated assessment of protein structure template qualit y ,” Pr oteins: Structur e, F unction, and Bioinformatics , v ol. 57, pp. 702–710, Jan. 2004. doi:10.1002/prot.20264 . [21] C. Hardin, M. P . Eastw oo d, M. Prentiss, Z. Luthey-Schulten, and P . G. W olynes, “Folding funnels: The key to robust protein structure prediction,” Journal of Computational Chemistry , vol. 23, no. 1, pp. 138–146, 2002. doi:10.1002/jcc.1162 . [22] S. W u, J. Skolnic k, and Y. Zhang, “Ab initio modeling of small proteins by iterative T ASSER sim ulations.,” BMC Biol , v ol. 5, p. 17, May 2007. doi:10.1186/1741- 7007- 5- 17 . [23] R. Garcia and T. Nickels. “Uploadify - a multiple ﬁle upload plugin for jQuery ,” [online, cited 2009-11-06]. [24] C. Krintz, C. Bunch, N. Chohan, J. Chohan, N. Garg, M. Hubert, J. Kupferman, P . Lakhina, Y. Li, G. Meh ta, N. Mostafa, Y. Nomura, and S. H. Park. “AppScale - op en source implementation of the Go ogle App Engine,” [online, cited 2010-09-21]. [25] T. Ro daeb el and F. Glanzner. “TyphoonAE - environmen t to run Google App Engine (Python) applications,” [online, cited 2010-09-21]. [26] “Go ogle App Engine pro ject roadmap,” [online, cited 2010-09-21]. P . Widera and N. Krasnogor • Protein Models Comp ara tor page 8 of 8

Protein Models Comparator: Scalable Bioinformatics Computing on the Google App Engine Platform

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment