Scientific Workflow Systems for 21st Century e-Science, New Bottle or New Wine?

With the advances in e-Sciences and the growing complexity of scientific analyses, more and more scientists and researchers are relying on workflow systems for process coordination, derivation automation, provenance tracking, and bookkeeping. While w…

Authors: ** 1. Yong Zhao – Microsoft Corporation, Redmond, WA

Scientific Workflow Systems for 21st Century, New Bottle or New Wine? Invited Short Paper 1 Yong Zhao, 2 Ioan Raicu, 2,3,4 Ian Foster 1 Microsoft Corporation, Redmond, WA, USA 2 Department of Computer Science, Un iversity of Chicago, Chicago, IL, USA 3 Computation Institute, Unive rsity of Chicago, Chicago, IL, USA 4 Math & Computer Science Division, Argonne National Laboratory, Argonne, IL, USA yozha@microsoft.com, iraicu@c s.uchicago.edu, foster@mcs.anl.gov Abstract With the advances in e-S ciences and the growing complexity of scientific analyses, mo re and more scientists and researchers are relying on workflow systems for pr ocess coordinati on, derivat ion automation, provenance tr acking, and bookkeepi ng. While workflow systems have been in use for decades, it is unclear wheth er scientific workflows can or even should build on existing workflow tech nologies, or they require fundament ally new approaches. I n this pape r, we analyze the status and challeng es of scientific workflows, investig ate both existing tech nologies and emerging languag es, platforms and systems, and identify the key challenges that must be addressed by workflow systems for e-science in the 21 st century. 1. Introduction Scientific workflow has become increasing ly popular in modern scientific computatio n as more and more scientists and researchers are relying on workflow systems to conduct their daily science analysis and discovery . With technology advances i n both scienti fic instrum entati on and sim ulation, the amount of sci entific data sets is growin g exponenti ally each year, such large data size com bined with growing complexity of data analysis procedu res and algorithms have rendered traditional manual processing and exploration unfavorable as com pared with modern in silico processes automated by scientific wo rkflow systems (SWF S). While the term workflow speaks of different things i n different c ontext, we find in ge neral SWFS are e ngaged a nd applied to the f ollowin g aspects of scientific computation s: 1) describing complex scientific procedures, 2) autom ating dat a derivation pro cesses , 3) high perfor mance comp uting (HPC) to improve through put and performance , and 4) provenance manage ment and query . Workf lows are not a n ew concep t and have be en around for decades. There were a number of coordinatio n langua ges and sy stems devel oped in t he 80s and 90s [1,7], w hich share m any comm on characteristic with workflow systems (i.e. they describe indi vidual com putation compone nts and thei r ports and ch annels, and the dat a and event flow between them). They also coordi nate the executi on of the component s, often on paral lel computi ng resources . Furthermore, business proc ess managem ent systems have been de veloped a nd invested i n for y ears; there are many mature commercial pr oducts and industry standards such as BPEL [2]. In th e scientific community there are also m any emerging systems for scientific program ming a nd com putation [5,22]. Before we jump on de veloping ye t another wo rkflow system , a fundamental question to ask is whether we can use existing technol ogies, or we sh ould invent new languages and systems i n order to achieve t he four aspects mentioned earlier that are essential to scientific workflow systems. This paper identifies the challenges to workflow development in th e context of scientific computati on; we present an overvi ew of some of the existing technologies and emerging systems, and discuss opportunities in addressing th ese challenges. 2. Multi-core processor architectures Software deve lopment has b een on a free ride for performance gain as chipmakers con tinue to follow Moore's Law in doubling up transistors in minuscule space. Little consideration has been given to code parallelization since it has not been essen tial for the average computer use r un til recently, when single CPU core perfo rmance gro wth stagnate d and m ulti-core processors em erged o n the market i n 2005. Due to the limitation s to effectively increasing processor clock frequency, hard ware manufactu res started to physically reorga nize chips into what we call the multi-core architecture [1 0], involving linking several microprocessor cores together on th e same semiconductor. Various ma nufactures from Intel, AMD, IBM, S un, have relea sed dual-core, qua d-core, eight-core, and 64-threaded proces sors in the past few years [13,21]. Given that 128-thr eaded SMP systems are a reality today [21], it is reasonable to assume that 1024 CPU cores/threads or more per SMP system will be available in the next decade. The new multi-core architecture will force radical changes in soft ware design and developm ent. We are already seeing significant in crease of researc h interests in concurrency and parallelism, and multi-core software devel opment. T he number of m ultiprocessor research papers has i ncreased sharply since year 2001, surpassing t he peak poi nt in al l the past years [10] . Concurre ncy is one of the next big challenges in how we write software sim ply because our industry ha s been driven by requirements to write ever larger systems that solve ever mo re com plicated problems and exploit the ever greate r computi ng and stora ge resources that are available [18]. 3. The data deluge challenge in science Within the science domain, the d ata that needs to be processed g enerally gr ows faster tha n com putational resources and their speed. The scientific community is facing an imm inent flood of data expect ed from the next gene ration of e xperim ents, sim ulations, sens ors and satellites. Scientists are now attempting calculations requirin g orders o f magnitude more computing a nd comm unication tha n was possi ble only a few years ago. Moreover, in m any currently planned and future experiments, th ey are also pl anning to generate seve ral orde rs of magnitu de more data t han has been collect ed in the entir e human hist ory [9]. For instance, in the astronomy domain the Sloan Digital Sky Survey (http://www.sdss.o rg) has datasets that exceed 10 terabytes in size. T hey can reach up to 100 terabytes or even petabytes if we consider multiple surveys and t he time dim ension. In phy sics, the CMS detector being built to run at CERN’s Large Hadron Collider (http://lhc.web.cern.ch/lhc) is ex pected to generate over a petabyte of data per y ear. In the bioinform atics dom ain, the rate of gro wth of DN A databases such as Gen Bank ( http://www.psc.edu/general/software/packages/genbank/ ) and EMBL (European Molecular Biology Laboratory, http://www.embl.org) has b een following an exponential trend, with a doub ling time estimated to be 9-12 m onths. To enab le the storag e and analysis of large quantities of data and to achieve rapid turnaroun d, data needs to be distributed over thousan ds to tens of thousands of com pute nodes. In such ci rcumstances, data locality is crucial to the successful and efficie nt use of large scale distribute d systems for data-intensive applications [19]. Scientific wo rkflows are generally execu ted on a shar ed infra structu re such as T eraGr id (http://www.teragrid.org), Open Science Grid (http://www.openscienceg rid.org), and dedicated clusters, where data movement relies on shared file systems that are known bottlenecks for d ata intensive operations. If data analysis workloads have locality of reference, then it is feasible t o cache and re plicate data at each individual compute node, as high initial data movem ent costs can be offset by many subseque nt data operations performed o n cached dat a [15]. Modern scientific workflow systems need to set large scale data managem ent as one of its primary objectives, a nd to e nsure data m oveme nt is mi nimized by intelligent data-aware scheduling both among distributed computing sites (assum ing that each site has a lo cal area ne twork sha red stor age infras tructure ), and among com pute nodes (a ssuming that dat a can be stored on com pute node s’ loca l disk and/o r memory ). 4. Supercomputing vs. Grid Computing Supercomput ers had thei r golden a ge back i n the 80s when there were virtual ly no other choices i n dealing with compute-intensive task s. They were applied mostly to scientific mo deling and simulation in various disciplines such as high energy physics, earth science, biology, mechanic a l engineering etc. Som e typical applications incl uded weather forecasting, missile trajectory simulation, airplane wind tunnel simulati on, genomi cs etc. However , supercom puters are expensive and scarce resources where only national laboratories, government agencies and s ome universities have access to them ; and the parallel architectures of supercomputers ofte n dictate the use of special programming techniques to exploit their speed, such as special-purposed FORTRAN com pilers, PVM, MPI and OpenMP [9]. Over the last decade, we have observed processor speeds, stora ge capacity per drive, and network bandwidth increase 100~1000 times. As a consequence, cluster com puting a nd Grid com puting e nvironm ents that leverage the cheaper commodity computing and storage hardw are have been actively adopted for scientific computation s. Cluster computing usually involves hom ogeneous m achines i nterconnected by high speed network with lo cally accessible storage in one adm inistrative domain, whe re Grid com puting focuses on di stribute d resource sharing a nd coordination across multiple "v irtual organizations" that may span many ge ographical ly dist ributed administrative domains. Grids can also be categorized into Computational Grids and Data Grids, where the former mostly tackle computation inten sive tasks, and the latter target data-intensive sciences. With the introduction of multi-core architectures, the separation between Grid Comp uting and Supercom puting is bec oming l ess clear. Ma ny supercomputers are being built on multi-co re chips with high speed in terconnection. The Cray XT5 system (http://www.cray.com/products/xt5/in dex.html) uses thousands co mmodit y Quad-Core AMD O pteron™ processors and has a unified Li nux enviro nment. The latest IBM BlueGene/P Supercom puter (B G/P, http://www.research.ibm.com/b luegen e/) has quad core processors with a total of 160K-cores, and has support for a lig htweight Li nux kerne l on the c ompute nodes, making it si gnificantly more accessibl e to new applications [17]. Finall y, a sm aller system named SiCortex (http://www.sicortex.co m/) is also worth mentionin g; it boasts 6-c ore process ors for a tota l of 5832-cores , and runs a standard Linux envir onment. Supercom puters (e.g. IBM Bl ueGene) have traditionally been designe d and used for tightly coupled massively parallel app lications, typically implem ented in M PI. T hey have not been an i deal preferred pl atform for executi ng loosely coupled applications that are typical in many scientific workflows. Grids have seen success in the execution of tightly coupled p arallel applications, but th ey has been the platform of choice for l oosely coupled applications mostly due to the flexibility and gr anularity of the resource managem ent and the execution of singl e processor jobs with ease. Work is underway within both the Falkon [14] and Condor [20] projects to enable the latest BG/P to efficiently suppor t loosely coupled serial jobs without any modification s to the respective application s, and hence enabling an en tirely new class of appli cations that were never candi dates as possible use ca ses for the Bl ueGene/P supe rcomput er. Scalability and performance are top priorities for SWFS. To this end, it is necessary to leverage supercomputing res ources as well as Grid computing infrastructures for large scale parallel com putations.  5. Existing and emerging workflow technologies DAGMan (http://www.cs.wisc.edu/condor/dagman ) and Pegasus [6] are two sy stems that are comm only referred t o as workflow systems and h ave been wi dely applied in Gri d environm ents. DAGM an provides a workflow e ngine that m anages Cond or jobs orga nized as directed acyclic graphs (DAGs) in which each ed ge corresponds to an explicit task precedence. Both systems foc us on t he sche duling and execution of long running jo bs. Taverna [12] is an open source workflow system particularl y focused on bioi nforma tics applicati ons and services, and i t is based on the XScufl (XML Sim ple Conceptual Unified Fl ow) langua ge. Kepler [ 11] is a scientific workflow system that builds on the Ptolemy- II system (http://ptolemy.eecs.berkeley.e du/ptolemyII/ ) , which is a visual modeling tool written in Java. Triana [4] is a GUI-based wo rkflow system for coordinating and executing a collection of services. All these systems have som e visual interfaces (also referred to as workbenches) that allow the graphical comp osition of workflows. While all of the existing SWFS po ssess great features and ad dress many aspects of workflow specification, execution an d managem ent problem s, it is unrealistic to expect one system to cover all the bases. The Workflow Bus proj ect [23] inst ead tries to leverage multiple existin g workflow systems to complim ent each other in im plem enting aggre gated functions and services. Finally, the evolutions of workflows themselves (explorations) are vital in scientific analysis. VisTrails [3] captures t he notion of a n evolving dataflow, a nd implem ents a histo ry managem ent m echanism t o maintain versi ons of a dataflow, t hus al lowing a scientist to return to pr evious steps, apply a dataflow instance to differe nt input data, explore t he param eter space of the dataflow, a nd (while perform ing thes e steps) compare the associat ed visualization results. In response to the pressing demand of sci entific applications, a nd the hu nger for com puting p ower, there have bee n a few em erging langua ges and sy stems that try to tackle the prob lems taking unconv entional approaches. MapReduce [5] is regarded as a power-leveler that solves com plicated com putation pr oblems usi ng brutal- force computation power. It provides a v ery simple programm ing model and power ful runtim e system for the processing of large dat asets. The programm ing model is ba sed on j ust two key functions: “map” and “reduce,” borr owed from functional languages. The MapReduce runtime system auto matically partitions input data an d schedules t he execution of program s in a large cluster of commodity machin es. The system is made fault tolerant by checki ng worker nodes periodically and reassigni ng fail ed jobs t o other w orker nodes. MapR educe has b een mostly appl ied to document processing pr oblem s, such as distri buted indexing, sorting, a nd clusteri ng. The Fortress language (http://for tress.sunsource.net ) recently released by Sun Microsystem s is a new programmi ng language designed for HPC, and aim s to improve programmabili ty and productivity in scientific computati on. The langua ge has been desi gned from ground up, s upportin g mathem atical notat ion (in Unicode) a nd physi cal units and dimensi ons, static type checking of multidimensional arrays and matrices, and rich function ality in libraries. It suppo rts transactions, specification of lo cality, and implicit parallel computation (e.g. parallel for loops). Although Fortress in a strict sense is not a workflow lang uage, and its adoption remains to be seen, it provides the higher level abstractions and functionalities for building a parallel workflow l anguage. Microsoft Windows Wo rkflow Fou ndation ( WWF) [16] provi des a gene ric fram ework for wor kflow development and execut ion. I t is foc used on integratin g diverse components within an application, allowing a workflow to be deployed a nd manage d as a native part of the a pplication. The funda mental idea behind W WF is that each activity is modeled as a res umable program statement, and the invocation of an activity is asynchronousl y organized, thus a program can be compared to a bookm ark, which can be frozen in action, serialized i nto persist ent storage, a nd resum ed after arbitrarily long time la ter. However, WWF is not a full-fledged workflow management system in that it lacks adm inistration, monitori ng, retry mechanism , load balancing, etc. for a productio n environment. Star-P (http://www.interactivesupercomputing.co m ) approaches the integration of scien tific applications and HPC via la nguage exte nsion – all owing scient ists to work in their familiar programming environments such as MATLAB, Python, and R, with some parallel directives. Internally the system can sche dule the execution of parallel tasks to a computation cluster pre- configured with scientific calcu lation libraries. The system has been a pplied t o a wide variety of computati on proble ms, but the pe rform ance improvement is mostly intra-app lication parallelization, instead of int er-com ponent coo rdination a nd manageme nt. Swift [22] i s an emergin g system that bridges scientific workflows with parallel computing. It is a parallel pr ogramm ing tool for ra pid and reliabl e specification, execut ion, and m anagement of la rge- scale science and engineeri ng workflows. Swift takes a structured approach to workflow s pecification, scheduling a nd executio n. It consi sts of a sim ple scripting language called SwiftScript fo r concise specifications of compl ex paralle l computati ons based on dataset ty ping and i terations , and dy namic data set mappings for accessing large scale datasets represe nted in diverse data formats. The runtime system relies on the CoG Karajan workfl ow engine for efficient scheduling an d load bala ncing, and i t integrates the Falkon [14] l ight-wei ght task exe cution se rvice for optimized task throughpu t and resource efficiency delivered by a streamline d dispatcher, a dynam ic resource provi sioner, and the data diffusi on mechanism to cache datasets in local disk or m emory and dispatch tasks according to data locality. 6. Call for scientific workflow systems Existing techno logies and systems already address many of the fundamental issues in scientific workflow specification a nd managem ent, and m any of them have been successful applied to various scientifi c applications across multiple science disciplines. However, modern multi-core architectures and parallel and distrib uted com puting technol ogies, and the exponentially growing scientific da ta are bound to change the landscape and evolution of scientific workflow system s. As already being manifested by the few emerging systems, the science comm unity is demanding both specia lized, dom ain-specific languages to im prove prod uctivity and eff iciency in writing concurrent programs and coord ination tools, and generic platform s and infrastructures for the execution and management of la rge scale scientific applications, where scalability and performance are major conce rns. High perf ormance c omputi ng support has become a indispensa ble piece of s uch workfl ow languages a nd system s, as there is no othe r viable way to get around the lar ge storage and computin g problems emerging in every disciplin e of 21 st century e-science, although what ma y be the best approach to enabling scientists to leverag e HPC technologies as transparent and efficien t as possible remains unanswere d. In the science domain, there is an increasing need for programming languages to exp ose parallelism, whether it’s done explicitly or implicitly, to s pecify the concurrency within a compon ent, or across multiple independent compone nts. There is a need for ne w parallel or workflow languages that adopt implicit parallelism where data depe ndencies can be discove red by its compiler, and independent tasks in the ord ers of hundreds o f thousan ds can be sc hedule d to ru n in clustered or Grid envi ronment s. Such syst ems coul d achieve improvements in bo th manageability and productivity. Scientific workflow systems aim to pr ovide a simple concise notation that allows easy parallelization and supports the compos ition of large numbers of parallel computations, therefore th ey may not need all the constructs and features in a full-fledged convention al language, and implicit parallelism is preferred to explicit p arallelism specification, as the latter requires expertise and attention to the details of parallel pr ogramm ing, whic h may be diffic ult for end users. But in the mean time sometimes scientists do need more con trol in specifying how to distrib ute their applications and datasets. We are also i n need o f comm on generic infrastructures and platform s in the science dom ain for workflow a dminist ration, sched uling, exec ution, monitoring , provena nce tracking etc. While bu siness process management has in dustry agreed upon standards a nd steering c ommitt ees, we don’t have thes e in the science domain, where often time people reinvent the wheel in deve loping thei r in-house yet another SWFS, and there is no easy way in integrating various workfl ow systems and speci fications. We also argue that in order to address all the impor tant issues such as scalability, reliab ility, scheduling and monitori ng, dat a managem ent, coll aboration, workfl ow provenance, and workflow ev olution, o ne system cannot fit all needs. A stru ctured infrastructure that separates the concerns of wo rkflow specification, scheduling, execution etc, yet is orga nized on to p of components t hat specialize on one or more of the area s would be m ore appropriat e. 7. References [1] Ahuja, S., Carriero, N., and Gelernter, D., “Linda and Friends”, IEEE Computer 19 (8), 1986, pp . 26-34. [2] Andrews, T., Curbera, F., Dho lakia, H., Goland, Y. , Klein, J., Leymann, F., Liu, K., Roller, D., Smith, D., Thatte, S., Trickovic, I., Weerawarana, S.: Business Process Execution Language for Web Services, Version 1.1. Specification, BEA S yst ems, IBM Corp., Microsoft Corp., SAP AG, Siebel Systems (2003). [3] Callahan, S.P., Frei re, J., Santos, E ., Scheidegger, C. E., Silva C.T. and Vo, H.T. Mana ging the Evolution of Dataflows with VisTrails. IEEE Worksh op on Workflow and Data Flow fo r Scientific Applications (SciFlow) 2006. [4] Churches, D., Gombas, G., Harrison, A., Maassen, J., Robinson, C., Shields, M ., Taylor, I., Wang, I. , Programming Scientific and Di stributed Workflow with Triana Services. Concurrency and Computation: Practice and Experien ce. Special Is sue on Scientific Workflows, 2005. [5] Dean, J. and Ghemawat, S., MapReduce: Si mplified data processing on large clusters . OSDI, 2004. [6] Deelman, E., Singh, G., Su, M.-H., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Vah i, K., Berrim an, G.B., Good, J., Laity, A., Jacob, J. C. and Katz, D.S. Pegasus: A Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scien tific Programming, 13 (3). 219-237. [7] Foster, I., “Compositional Parallel Programming Languages”, ACM Transactions on Programming Languages and S ystems 18 (4), 1996, pp. 454-476. [8] Gray, J. “Distributed Com puting Economics”, Technical Report MSR-TR-2003-24, Microsoft Research, Microsoft Corporation, 2003. [9] Hey, T., Trefet hen, A., The data de luge: an e-s icence perspective, Gid Compu ting: Making the Global Infrastructure a Reality, 2003, Wiley. [10] Hill, M., and Marty, M., Amdahl's Law in the Multicore Era. The 14th International S ymposium on High- Performance Computer Architecture, 2008 [11] Ludäscher, B., Altintas, I., Ber kley, C., Higgi ns, D., Jaeger-Frank, E., Jones, M., Lee, E. , Tao, J., Zh ao, Y., Scientific Workflow Management and the Kepler System. Concurrency and Computation: Practice & Experience, 2005. [12] Oinn, T., Addis, M. , Ferris, J., Marvin, D., Senger, M., Greenwood, M., Carver, T., Gl over, K., Pocock, M. R., Wipat, A., Li, P., Tav erna: A tool for the composition and enactment of bioinformatics workflows. Bioinformatics Journal, 20(17):3045–3054, 2004. [13] The Potential of the Cell Processor for Scientific Computing. Computationa l Research Divis ion, Lawrence Berkeley National Laboratory, 2007. [14] Raicu, I., Zhao, Y., Dumitres cu, C., Foster , I., Wilde, M. , “Falkon: a Fast and Lightweight Task Execution Framework”, IEEE/ACM Supercomputing, 2007. [15] Raicu, I., Zhao, Y., Fo ster, I., S zalay, A. “ Accelerati ng Large-scale Data Explorati on through Data Diffusion”, ACM/IEEE Workshop on Data-Aware Distributed Computing, 2008. [16] Shukla, D., Schmidt, R., Essential Windows Workflow Foundation (Microsoft .NET Development Series), Addison-Wesley Pr ofessional, 2006. [17] Stevens, R., The LLNL/ANL/IBM Coll aboration to Develop BG/P and BG/Q, DOE ASCAC Report, 2006. [18] Sutter, H., The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb's Journal 20(3), March 2005. [19] Sz alay, A., Bunn, A.., Gray , J., Foster, I., Ra icu, I. “The Importance of Data Locality in Distributed Computing Applications”, NSF Workflow Workshop 2006. [20] Thain, D., Tannenbaum, T., Livny, M. “Distributed Computing in Practice: The Condor Ex perience” Concurrency and Computation: Practice and Ex perience, Vol. 17, No. 2-4, pages 323-356, February-April, 2005. [21] UltraSPARC® T2 Processor, The world’s first true system on a chip, Sun Micr os ystems Datasheet, 2008. [22] Zh ao, Y., Hategan, M. , Clifford, B., Foster, I., Laszewski, G.v. , Raicu, I. , Stef-Praun, T., Wilde, M., Swift: Fast, Reliable, Loosely Coupled Parall el Computation, IEEE Workshop on Scientific Workflow (SWF07), Collocated with SCC 2007. [23] Zhao, Z., Belloum, A., de Laat, C. , Adriaans, P., Hertzberger, R.: Distributed ex ecution of aggregat ed multi domain workflows using an agent framework. IEEE Workshop on Scientific Workflow, 2007.

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment