Expanding the scope of statistical computing: Training statisticians to be software engineers

Expanding the scope of s tatis tical computing: T raining s tatis ticians to be softw are engineers Ale x R einhar t and Chr istopher R. Geno v ese October 30, 2020 Abstract T raditionall y , statis tical computing courses ha v e taught the syntax of a par tic- ular programming languag e or speciﬁc statistical computation methods. Since the publication of Nolan and T emple Lang [2010], we hav e seen a g reater emphasis on data wrangling, reproducible research, and visualization. This shift better prepares students f or careers w orking with comple x datasets and producing analy ses for multiple audiences. But, w e ar gue, s tatisticians are now often called upon to de v elop s tatistical softwar e , not just analy ses, suc h as R packag es implementing ne w anal ysis methods or machine learning sy stems integrated into commercial products. This demands diﬀerent skills. W e describe a graduate course that w e dev eloped to meet this need b y focusing on f our themes: programming practices; softw are design; important algor ithms and data structures; and essential tools and methods. Through code revie w and revision, and a semester -long software project, students practice all the skills of softw are engineer ing. The course allow s s tudents to e xpand their unders tanding of computing as applied to statistical pr oblems while building e xper tise in the kind of softw are dev elopment that is increasingly the pro vince of the w orking statistician. W e see this as a model f or the future ev olution of the computing cur riculum in statis tics and data science. 1 Intr oduction When N olan and T emple Lang [2010] wrote their seminal paper on the role of computing in statis tics and statis tics cur r icula, they noted the rapid chang e in the skills needed by practicing s tatisticians. It w ould no long er be suﬃcient f or s tatisticians to learn computing onl y as a collection of numer ical methods or specialized statis tical algorithms, such as Mark ov chain Monte Carlo or generating pseudo-random numbers. Statis ticians no w f ace 1 larg e quantities of data, often in new forms like text or netw orks, and this data must be obtained—such as from W eb ser vices or databases—then managed, wrangled in comple x w a ys, and visualized. They argued that ar ming students with a solid computational base will prepare them to adapt to the wide rang e of problems they will see on the job—and that these computational skills will also giv e them ne w w ay s to e xplore and unders tand the s tatistical concepts. The y sugg ested sy llabi and cur ricula that w ould advance understanding of these skills in both undergraduate and graduate programs [N olan and T emple Lang, 2009]. This premise has only become more true in the interv ening y ears. As the con versation shifts to “data science ” and organizations appl y s tatistical thinking to an ev er wider range of problems, s tatisticians mus t use their computational skills to acquire data from disparate sources, integ rate it into a useful f or m, conduct exploratory anal y ses and visualizations to understand the data ’ s full comple xity , and onl y then use statis tical procedures to draw conclusions. T o ensure these conclusions are reproducible, statisticians must also use computational tools lik e knitr [Xie, 2015] and the command line to automate a pipeline of scripts, analy ses and results. In this paper , w e argue that though these computational skills are important, f or some statis ticians the y are only a fraction of what is no w needed. Man y statisticians no w ﬁnd themsel ves delivering not analyses —in the f or m of repor ts or presentations on some statis tical analy sis—but products that are used continuall y . In academia, these products might be R pac kages implementing a new l y de veloped statistical method, so that others can appl y the same method to their o wn problems. In industry , these products could be ne w methods to detect fraud or impro v e adv er tising in a larg e online service, used continuousl y as ne w data ar rives and ne w decisions ha v e to be made. In either case, the product is often a larg e and comple x piece of softwar e with a codebase dev eloped b y a team o ver many months, and it ’ s nev er tr ul y “done”: it mus t be maintained and updated as conditions c hange and ne w requirements are placed on it. T o build and maintain these products, statis ticians need additional skills. Structur ing a larg e and comple x codebase so it can be easily understood requires pr inciples from softw are engineering; wr iting code with a team requires v ersion control sy stems and collaboration skills; appl ying ne w statistical methods to larg e and complicated data requires a ﬁr m understanding of algor ithms and data structures so the resulting code will be eﬃcient. And e v er ything must be w ell-tested and debugged so colleagues, bosses, and users can hav e conﬁdence in the results. These skills are less important f or a one-oﬀ data analy sis, but the y are crucial f or the tasks s tatisticians face as they put their e xper tise into practice as part of long-running sys tems and widel y used products. Beginning in 2015, we ha v e dev eloped a graduate-le vel course in statis tical computing intended to teac h software engineer ing skills. The course is now par t of the required cur riculum f or both the Master’ s in Statis tical Practice and the Ph.D. in S tatistics & Data 2 Science at Car negie Mellon U niv ersity , serving roughl y 40–50 students per year . Mos t of these s tudents ha v e prior statis tical programming e xper ience from under g raduate courses, and our course is their only required graduate-le v el statis tical computing course. The students ha ve widel y varied back grounds, and though the master ’ s prog ram emphasizes prof essional skills [see Greenhouse and Seltman, 2018] while the Ph.D. prog ram emphasizes theoretical and applied research, the course goals are shared: to prepare s tudents to build comple x statis tical softw are. In this paper, w e set out the skills we aim to teach and the strategies w e use to teach them. Our pedagogy has ev ol ved e very year as we ha v e discov ered that the course pedagogy is ine xtr icable from its content. F or students to learn complex computational skills, w e must giv e them regular practice with these skills and rapid, targ eted f eedback on their performance. W e argue that these skills are becoming increasingl y impor tant f or graduate- le v el statisticians and cannot be left to others to ﬁll in. Computation has only g ro wn in impor tance in the ten y ears since N olan and T emple Lang issued their call to action, and w e e xpect it will onl y become mor e impor tant in the ten y ears to come. 2 R ole of Com puting in Statistics and Data Science Bef ore w e discuss the statis tical computing course w e de veloped, it will be useful to brieﬂy trace the e v olution of computing’ s role in statistics and data science. Since roughl y 2000, a ma jor f ocus of w ork and teaching in statistical computing has been “Literate Statis tical Practice ”, which “encourages the construction of documentation f or data manag ement and statistical analy sis as the code f or it is produced” [R ossini, 2001]. T ools such as Swea v e [Leisch, 2002] beg an to allo w statis ticians to embed the code producing their anal ysis inside the text of the analy sis repor t, so that a single command w ould run the anal y sis, produce the results, f or mat tables and ﬁgures, and typeset the report f or distribution. This had man y practical adv antages, making it easy to make small chang es and then re-run anal y ses and reports from scratch, and has become ev en more impor tant as reproducibility has become a major concer n. T ools lik e knitr and R Markdo wn [Xie, 2015, Xie et al., 2018] ha ve made reproducible repor ts easier to write and easier to use, contr ibuting to their rapid spread. These tools are no w widel y used in statis tics education and in practice. Baumer et al. [2014], f or ex ample, use knitr in introductory statis tics courses to “de velop the basic capacity to undertake modern data anal ysis and communicate their results, ” and Çetinkay a-R undel and Rundel [2018] descr ibed its coordinated use in an undergraduate course sequence designed to de v elop statistical computing skills in students from the introductor y lev el on ward. In industr y , Bion et al. [2018] describe the widespread adoption of knitr among 3 the data science team at Airbnb, who use it routinely to share their anal yses and business e xper iments with each other , with management, and e v en publicly as blog posts and academic publications. The other main emphasis of statis tical computing work has been on tools to make wrangling, restructur ing, summar izing, and aggreg ating data muc h easier . There is a gro wing emphasis on “tidy data ” [W ickham, 2014], and the R community has de v eloped man y ne w packag es [e.g. the Tidyv erse, Wic kham et al., 2019] that make it easy f or statis ticians to e xpress the operations the y need to wrangle their data into the mos t conv enient f orm. Other packag es f acilitate interactiv e visualizations or make it easy to present statis tical results in written repor ts. Statistical computing cur ricula ha v e adapted to include these tools and to giv e students authentic practice wrangling messy data. But w e should not nar row our f ocus too q uickl y . N ot ev ery statis tical task ﬁts into the frame work of “receiv e a ques tion, wrangle the data, conduct an anal y sis, and write a repor t on the results.” No longer releg ated to roles as consultants or analy sts brought in to answer speciﬁc ques tions, statis ticians and data scientists increasingl y hold roles as integ ral parts of teams de veloping products and delivering ser vices. The y need “data acumen, ” which includes facility with a much wider range of tools and the ability to collaborate with softw are engineering teams and other disciplines [National A cademies of Sciences, Engineer ing, and Medicine, 2018, Chapter 2]. Bion et al. [2018] pro vided an insightful e xample of this shift. At Airbnb, a service allo wing proper ty o wners to lis t spaces f or short-ter m rentals, the data science team might build “a mac hine lear ning algorithm that takes into account a variety of points of inf ormation ” to sugg est a f air pr ice f or a hos t to char g e f or guests. But the outcome of this w ork is not a repor t to be submitted to management descr ibing the results of their modeling eﬀor ts—after de v eloping a prototype model, the y “work ed with engineers to br ing the prototype into production, ” where hosts no w use its recommendations e v er y day . That is, the ﬁnal outcome w as the deplo yment of a piece of statis tical software, which no w continuall y operates as a par t of Airbnb’ s core business. W e can also look be y ond industry to see s tatistical softw are used f or pur poses other than writing analytical repor ts. Consider a Ph.D. student conducting theoretical w ork on ne w models f or some complex type of data. This w ork ma y in v olv e thousands of lines of code: code to simulate data with kno wn parameters, code to estimate the model from data, code to r un simulation s tudies that v er ify theoretical results, code to calculate diagnostics or measure goodness of ﬁt, code to ﬁt benc hmark models and run compar isons, etc . Muc h of the code f orms a product that an ambitious g raduate student might release as an R pac kage submitted to CRAN or a Python packag e on PyPI, allo wing other researchers to beneﬁt from their theoretical labor and use the ne w l y de v eloped methods f or their own practical pur poses. And the wide a vailability of these statistical products has been a boon f or the 4 ﬁeld, allo wing ne w statis tical methods to be quic kly adopted in industry [Bion et al., 2018]. The broader impact of this statis tical software ecosys tem is hard to o vers tate. The shift in statis tical computing is noticeable in the w ork done by statisticians in academia, but also by the jobs the y take in industry . F or e xample, of 91 total graduates of Carnegie Mellon Univ ersity’ s Bac helor’ s in Statis tics & Data Science prog ram in 2018, 56 repor ted being employ ed at the time of a sur v e y of their career outcomes, and of these, 14 (25%) repor ted a job title implying a software de v elopment role, such as “Software Engineer” [CMU Career & Prof essional De velopment Center, 2018]. It has been our e xper ience that man y industry roles titled “Data Scientist ” or “Data Anal y st ” also hea vily in v olv e softw are de v elopment. 3 Course Content In the ne xt sections, w e discuss what it w ould mean f or a statistical computing cur r iculum to prepare students f or these roles, and discuss a course w e de v eloped to do so. W e f ocus on f our themes—four sets of skills students must lear n to eﬀectiv el y dev elop s tatistical products and not jus t statistical repor ts. These themes are cov ered in lectures, but it is also vital that the course giv e s tudents repeated practice with all these skills, and the necessar y assignments and pedagogy will be discussed in Section 4. 3.1 F our Themes for Statistical Pr oduct Dev elopment 3.1.1 Eﬀectiv e Programming Practices Students mus t lear n practices that make softw are more reliable, more usable, and easier to maintain. Such practices include testing, code re view , clear naming, and eﬀective documentation. U nit testing, f or e xample, is often adopted in prof essional softw are de v elopment to ensure code is cor rect and def ects are not accidentally introduced. A unit test isolates a speciﬁc “unit ” of code, suc h as a function or class, and runs that unit with speciﬁc inputs, then v er iﬁes that the unit ’ s output matc hes the e xpected output. Unit tests are written using a packag e designed to org anize tes t cases, r un all tests automatically with a single command, and repor t summary results indicating which tes t cases f ailed and giving descr iptions of the failure. The testthat packag e is widely used f or R [Wic kham, 2011], and similar packag es are a vailable f or almost e v ery common programming language. U nit testing is an essential par t of softw are engineer ing f or se veral reasons. Most ob viousl y , it helps ensure correctness of software. If eac h function or method has detailed test cases, and these test cases can easily be check ed e v er y time the code is chang ed, mistak es 5 can be detected immediately . Software engineer ing research show s that while wr iting unit tests tak es e xtra time, this time can in some settings be made up in the time sa v ed ﬁxing problems and debugging er rors [Williams et al., 2003, Bissi et al., 2016]. There ha v e been notable cases of errors in statistical and scientiﬁc software going undetected for years, ev en as the software was used routinely f or scientiﬁc research, underscor ing the importance of eﬀectiv e tes ting [Eklund et al., 2016]. Less ob viously , dividing up comple x tasks into simple pieces—so the y can be easily tested—also encourag es softw are to be composed of small, easil y understood pieces, whic h is a ke y softw are design recommendation (see Section 3.1.2). Code re vie w is another essential prog ramming practice. In collaborativ e softw are projects, such as softw are de v eloped by a team in a lar g e compan y or an open-source packag e de veloped by a g roup of v olunteers, collaborators often practice peer code r eview [Rigb y and Bird, 2013, Sado wski et al., 2018]. Eac h proposed chang e to the softw are, such as a ne w f eature or ﬁx f or a problem, is submitted f or re view by a cow ork er or collaborator . The peer giv es line-b y-line f eedback on the code, enf orcing project sty le guidelines, looking f or ﬂa w ed reasoning and bugs, and giving f eedback so the code can be impro v ed. Only after the proposed c hange has passed peer re view is it merg ed into the product or packag e. Experiments hav e sho wn that code re vie w detects bugs and impro v es softw are quality , often by encouraging code to be clearer and easier to maintain [Mäntylä and Lassenius, 2009, Beller et al., 2014]. P opular softw are collaboration platf or ms like GitHub and GitLab suppor t code re view through “pull requests ” or “mer g e reques ts.” W e giv e s tudents e xper ience with code revie w in two wa ys. W e ﬁrst hos t an in-class activity in whic h students re vie wed real code wr itten (b y a course ins tr uctor) to sol v e a speciﬁc problem. Students are giv en a code re vie w chec klist to f ollo w , encouraging them to look at speciﬁc f eatures of the code and comment on them as part of their re vie w . Later in the semester , students conduct in-class peer code re view of their Challenge projects (see Section 4.3) using GitHub’ s code re vie w f eatures. 3.1.2 Fundamental Principles of Softw are Design Throughout the semester , w e emphasize a fe w ke y pr inciples of design . This includes modularity and code org anization, the wa y that the many features required of software are or ganized into ﬁles, functions, classes, scr ipts, and so on. Eﬀectiv e softw are design is a po werful means to manage comple xity . In a poor design, functions may become larg e and complicated, and interact with each other in complicated w a y s, so that changing one small part of the code ’ s behavior requires intricate surg er y on man y separate functions. In an eﬀectiv e design, functions are small and modular , and f eatures are clear l y separated so that changing beha vior onl y requires changing a fe w speciﬁc functions that are clearl y 6 responsible f or that beha vior . Good design also facilitates code reuse and reﬁnement. This kind of design is not a ma jor concer n when wr iting a literate statistical repor t, which is mostl y linear with a f e w helper functions. But when de v eloping a softw are packag e that ’ s intended to be reusable, careful design is essential—a good design makes it easy to modify and e xtend the pac kage, f or e xample as a Ph.D. student explores new methods in a thesis, while a bad design can mak e chang es e x cr uciatingly diﬃcult. The semester -long Challenge project, descr ibed belo w in Section 4.3, is designed to giv e students practical design e xper ience. Since the project requires students to build a complicated product o v er the entire semester , and later portions of the project require students to build on or modify ear lier por tions, they either e xper ience the beneﬁts of w ell-designed code or suﬀer the pain of modifying poor l y designed code. The teaching assistants also provide extensiv e f eedback on design, star ting bef ore students implement an y f eatures. 3.1.3 Important Algorithms, Dat a Structures, and Repr esentations In recent y ears, a larg e amount of statis tical research has been f ocused on scaling statis- tical methods to enormous datasets without an e xtrav agantl y larg e computational budg et. Commonl y , statistical computing courses prepare students to work with larg e datasets b y teaching them diﬀerent tools. SQL database sys tems, f or e xample, are designed to eﬃciently query massiv e datasets that do not ﬁt in memory , while software lik e Hadoop and Apac he Spark are designed to dis tr ibute calculations across multiple serv ers that each hav e their o wn chunk of data. Students might also lear n to use tools like Rcpp, whic h allow s users to write the most computationall y intensiv e par ts of their R pac kag es in eﬃcient C++ code that can be easil y called from within R [Eddelbuettel and Francois, 2011]. (Cython [Behnel et al., 2011] ser v es a similar role in the Python w orld.) And s tudents are often exposed to R programming f olk wisdom: use built-in functions whene v er possible, a v oid for loops in f a vor of v ector ization, and perhaps use packag es lik e data.table instead of nativ e data frames. But this misses the w ay s that careful softw are design can make code eﬃcient and scalable. First, the designer must select an algorit hm appropr iate to the task at hand, meaning the designer mus t be familiar with general strategies f or designing algor ithms. For e xample, the divide-and-conquer strategy is to reduce a larg e problem into sev eral smaller problems whose solutions can be combined to yield the o v erall solution; b y doing this recursivel y , a complex problem can be reduced to many small and tr ivial problems. The divide-and-conquer strategy is widel y used in computer science to produce algor ithms that scale w ell to large datasets (f or e xample, merg esor t is a divide-and-conquer sor ting algorithm), and it has been recentl y e xplored as a tool f or implementing s tatistical methods 7 on lar ge datasets [Jordan, 2013]. Dynamic programming is another widel y used s trategy to break problems into smaller problems whose solutions can be combined eﬃcientl y; f or e xample, the fused lasso can be e xpressed as a dynamic programming problem, leading to a linear -time algorithm [Johnson, 2013]. Along with the appropr iate algorithm, the designer must also select appropriate data structur es to store the data needed f or an algor ithm in an eﬃcient w ay . Students used to w orking in R f or data anal ysis tend to think of data frames, lis ts, matrices, and v ectors as the onl y a vailable data structures, and often write algor ithms that req uire repeatedl y scanning through an entire dataset to ﬁnd relev ant elements—which scales poorl y to larg e datasets. But data s tr uctures like hash tables (dictionaries), binar y trees, s tacks, and queues all ha v e their uses in statistical algor ithms. In statis tics, f or e xample, the 𝑘 -d tree can store 𝑛 data points, each in 𝑘 dimensions, and can ﬁnd all data points in speciﬁc intervals or rang es in 𝑂 ( log 𝑛 ) time, rather than requiring a loop through all 𝑛 data points [Bentley, 1975]. This can also be used to solv e 𝑘 -nearest-neighbor problems eﬃciently , and has been adapted to perf or m f ast approximate kernel density es timation [Gray and Moore, 2003]. Other tree data s tr uctures are widely used b y SQL databases to eﬃcientl y process q uer ies with comple x joins and WHERE clauses. Our S tatistical Computing course co vers basic algorithmic strategies such as divide- and-conquer and dynamic programming, as w ell as basic data str uctures. W e emphasize to s tudents that selecting the appropr iate algorithm and data structure can be much more impor tant than the ordinar y R per f or mance tips. An algor ithm that uses repeated (but v ector ized) scans through an entire data frame or v ector is intrinsically less eﬃcient than one that uses a tree to do the same operation in 𝑂 ( log 𝑛 ) time, f or e xample. In the course, various home work assignments pose simple problems that can be sol ved in an ob vious but tremendously ineﬃcient w a y as well as a less-ob vious but eﬃcient wa y using an appropr iate algor ithm and data structure. (These can be challenging in R, which does not pro vide eﬃcient data str uctures by def ault; f or e xample, looking up an item b y name in an R list requires an 𝑂 ( 𝑛 ) scan through all entr ies, and base R does not provide ﬂe xible collection data structures [Bar ry, 2018].) Along with the Challenge projects, these assignments teach students that f ast code often req uires careful thinking about the org anization and use of data. 3.1.4 Essential T ools and Methods There are many wa ys to produce good s tatistical softw are, but there are sev eral core tools that are almost universall y useful. Such tools include editors, integrated de v elopment en vironments, v ersion control sys tems, debugg ers and proﬁlers, databases (relational and otherwise), and the command line. This theme f ocuses on giving students substantiv e e xper ience with those tools to giv e them a f oundation f or building good practices and habits 8 going f orward. W e build e xper ience with such tools into the s tr ucture of the course, pro viding suppor t f or a rang e of quality tools while giving students as much ﬂe xibility as possible. For instance, we co v er using SQL in class and let s tudents interact with SQL databases through their f av orite programming language in assignments and class activities. Similarl y , while it is possible to work completel y through graphical user inter f aces (GUIs), we belie ve that command line tools can add v alue f or practioners and be a pow er ful tool in many circumstances. W e sho w students ho w to use these tools and build a set of practices f or the eﬀectiv e design and use of such tools. V ersion control is a more challenging e xample. It is a critical tool f or successfully de v eloping larg e-scale software in collaboration with others, allo wing team members to track the history of ev er y source ﬁle. It allow s chang es to be sys tematically recorded and re v er ted if necessary , and allow s collaborators w orking independentl y to make chang es to code without inter f er ing with each other ’ s w ork. V ersion control software is no w widel y used b y companies and by collaborativ e open-source software projects. The R sys tem itself, f or e xample, is de v eloped using the Subv ersion v ersion control sy stem. There are man y v ersion control sy stems a vailable, with non-tr ivial diﬀerences in use and details. W e f ocus par ticular l y on Git, which is perhaps the mos t widel y used modern v ersion control sy stem, particularl y with the g ro wth of w eb-based collaboration services such as GitHub, GitLab, and Bitbuc ket that enhance Git with online tools f or ﬁling bug repor ts, re viewing proposed c hanges to code, and tracking project timelines and miles tones. Students who are familiar with Git will be prepared to w ork at organizations that use Git or similar sys tems, or to collaborate on any of the thousands of open source data science packag es that organize their de velopment with Git. Bry an [2018] has also persuasivel y argued that Git is v aluable f or managing the data, code, and ﬁgures in v olv ed in a literate statis tical analy sis, such as data analy sis repor ts, further enhancing their reproducibility b y making the his tor y of c hanges visible. U nf or tunatel y , Git is not kno wn f or being user -fr iendl y . Its pr imary inter face is through the command line shell, and its documentation can be almost impenetrable to ne w users. W e ha v e f ound that simply teaching the concepts to s tudents in a lecture is not suﬃcient; students need e xtensive practice using Git throughout the semester to begin to grasp its concepts. Hence students use Git and GitHub to submit all homew ork assignments and course projects, s tar ting with an in-class tutorial during the ﬁrst week. R eaders interes ted in using Git in their o wn courses ma y beneﬁt from the e xperiences of Çetinka ya-R undel and R undel [2018] and F iksel et al. [2019], who discuss ho w to use Git in s tatistics courses and descr ibe common s tudent e xper iences, man y of which match what w e hav e seen in our course. 9 3.2 Anti- Themes It would perhaps be most accurate to say that our course teaches a problem-solving philosophy , encompassed in the f our themes presented abov e, rather than simpl y a collection of tools suited f or speciﬁc problems. This is reﬂected by sev eral topics we choose no t to co v er in the course. For e xample, our course does not teac h a pr ogr amming languag e . W e assume that our students hav e already had some exposure to programming, such as in an undergraduate statis tical computing course or through practical e xper ience conducting data anal yses, and so w e do not spend class time co v er ing syntax or prog ramming constr ucts. W e do not require s tudents to use an y speciﬁc programming language f or their w ork, and e xamples in our lectures are often giv en in R, Python, C++, Clojure, Rack et, and other languag es. As the concepts, practices, and skills co vered in the course are widel y relev ant, this design decision makes the course accessible to students from a range of programming back g rounds. W e belie v e an e ven bigger beneﬁt of this approach is the perspectiv e it oﬀers. A f ocus on a single languag e tends to conﬂate approaches to problems with the w a y their solutions can be e xpressed in that language. Instead, we often sho w e xamples in multiple languag es so that students can see both the commonalities that are conserv ed across most languag es as well as some contrast across other possible design choices and idioms. Students quic kly ﬁnd that, ev en e x cepting a f e w syntactic details, they can understand the approach taken in a wide rang e of languages and that this aﬀects ho w the y approach problems e ven in their c hosen languag e. W e encourag e students to g et some e xper ience in a ne w language, ev en if only on simple problems. Some s tudents use this oppor tunity to explore languages they e xpect to use in practice (such as Python, or C++ f or use with Rcpp), while others e xplore more widel y and pick up functional or s trongly typed languages. Similarl y , the Statis tical Computing course does not co v er speciﬁc pac kages, such as the Tidyv erse [Wic kham et al., 2019] or tools to obtain and wrangle data (such as W eb scraping sy stems). Such tools are important in practice, but w e f eel it is more impor tant f or the course to co v er fundamental computing concepts that will enable s tudents to eﬀectiv ely use whate ver tools ma y appear . 4 P edagogy One might suspect that a computing course emphasizing concepts without teaching an y speciﬁc programming languag es or tools—as our does—can’ t teach the practical skills the y need. But in a course intended to teach students a comple x skill, suc h as engineering statis tical softw are, the only wa y f or students to lear n the skill —and not jus t the prerequisite 10 Semester Changes Spring 2015 Pilot v ersion (half-semester) F all 2015 Challenge project introduced (one short project) F all 2016 Pull reques t & re vision sy stem; Master ’ s students join; 2 Challeng e projects F all 2017 V ar ious small content & activity impro v ements F all 2018 Switch to one f our -par t challeng e F all 2019 Recommended homew ork sc hedule pro vided F all 2020 Problem bank rotation; Master’ s students in separate course T able 1: A summar y of revisions and changes made to the S tatistical Computing course during each iteration, as discussed in Section 4. The course structure has under gone man y adjustments in response to e xper ience. kno w ledg e f or that skill—is regular practice with targ eted feedbac k. Regular practice giv es students the oppor tunity to practice the skills we teac h, while targ eted f eedback ensures the y lear n those skills and lear n from their mistakes. Hence the content of the S tatistical Computing course cannot be separated from its pedagogy . In this section, we descr ibe a course design that ensures students g ain regular hands-on practice and detailed f eedback, and the practical considerations that w ent into it. Man y of these pedagogical f eatures were dev eloped through e xper ience with each iteration of the course, and so the course has chang ed signiﬁcantl y ov er time; these chang es are summarized in T able 1 and discussed in the subsections that f ollo w . 4.1 A ctiv e Learning In-class activ e lear ning has been repeatedl y sho wn to impro ve student learning in a v ariety of S TEM ﬁelds [Freeman et al., 2014]. Most of our course lectures incor porate activ e learning activities in v arious f orms. For e xample, earl y in the semester we cov er unit testing; w e hav e f ound that students often str uggle to think of tes t cases f or code the y wr ite, so a larg e por tion of the unit-testing class is spent having the students work in small groups to think of tes t cases f or a f e w e xample functions. Our e xper ience has been that muc h of the student lear ning in a lecture seems to come from these activities. W e frequentl y disco ver that after spending 30 minutes lecturing on a par ticular topic and f eeling that the lecture is going w ell, an in-class activity rev eals that some students are still deepl y confused and hav e misinterpreted much of the lecture. Without these activities, the confusion could only be detected (and corrected) much later . For ke y concepts that all students must master to succeed in the course, w e ha ve 11 gradually shifted from long lectures to in-class group activities that s tudents can later turn in individuall y f or homew ork credit, ensuring that all s tudents practice the necessary skills bef ore completing other assignments or the Challeng e project. 4.2 Home w ork Problem Bank Because our s tudents hav e v ar ied le v els of programming e xper ience and ha ve varied goals f or the course, w e f elt that an ordinary homew ork assignment s tr ucture, where each s tudent completes the same assignments, w ould be inadequate. Some students ma y already be f amiliar with certain topics and require little practice, while others ma y be mos t interested in a speciﬁc topics the y e xpect to use in their research or future job and desire speciﬁc practice f or that topic. T o allo w students to choose assignments that suit their needs, w e de v eloped a problem bank containing o v er 70 programming problems, categorized b y topic and diﬃculty le v el. Some problems ha ve direct connections to statistics, while others simpl y illustrate programming pr inciples. Example assignments include: • Implement a kernel smoothing procedure, but allow the user to pass in any k er nel function and metric of their c hoice, not limited b y an y pre-speciﬁed list of kernels and metrics built in to the code. • Implement graph searc h algorithms to sol ve mazes. • Implement simple text tokenization to produce bag-of-w ords v ectors f or documents, then e xplore diﬀerent dis tance metrics betw een documents of diﬀerent types. T o give students constant practice the themes discussed in Section 3.1, students are required to write unit tests f or e very home w ork assignment, and mus t submit them f or re view using Git. An automated sys tem runs the unit tests submitted with eac h home w ork assignment and v er iﬁes that all tests pass. Throughout the semes ter , assignments from the problem bank are pos ted as the relev ant topics are cov ered. Students can select from all the pos ted assignments those they believ e are mos t interesting or relev ant to their needs, and complete the assignments at their o wn pace. The y ma y use an y programming languag e that at least one of the course ins tr uctors or T As is able to read, though mos t students choose Python or R. Our grading sy stem (see Section 4.4) simply requires s tudents to satisf actor ily complete a cer tain number of assignments b y the end of the semester . T o encourag e eﬀectiv e time manag ement b y students, who ma y be tempted to abuse the ﬂe xibility of the home w ork sy stem to put oﬀ submitting assignments until near the end of the semester , w e ha v e used tw o diﬀerent s trategies. Our ﬁrst approac h set a schedule by which 12 students are e xpected to complete cer tain numbers of home w ork points (see Section 4.4); our second approach pro vides a rotating selection of assignments and retires assignments from the problem bank after 2–3 w eeks, so s tudents must complete an assignment quic kly bef ore it becomes unav ailable. This also ensures that in a giv en w eek, the T As must only grade a f e w diﬀerent types of assignment, allo wing them to more eﬃciently grade. 4.3 Challeng e Project The homew ork problem bank allo wed students to g ain practice in man y topics co v ered in the course, but small homew ork assignments do not co v er a ke y learning goal of the course: lear ning eﬀectiv e s trategies to design and dev elop lar ge-scale software—with all the comple xity it entails—ov er the course of months or y ears. T o achie v e this, the course includes a semes ter -long Challeng e Project. Earl y in the semester , s tudents choose between se v eral Challeng es on varying topics, and w ork on their chosen project f or the res t of the semester . For e xample, one Challenge asks students to implement an algor ithm to build and pr une classiﬁcation trees, then use this code to build random f orest classiﬁers [Breiman, 2001]. The y then e xtend this code to build classiﬁcation trees using data stored in a SQL database without loading this data into memor y , in pr inciple allo wing the construction of trees f or v er y larg e datasets. Finall y , they scrape abstracts from the arXiv prepr int server and use their random f orest code to try to classify abstracts by subject categor y using features extracted from the abs tracts. Initiall y the Challeng e projects w ere designed to tak e sev eral w eeks and were due in one unit. But since Fall 2018, the Challenge projects are broken into f our par ts, due regularl y throughout the semester . The entire project tak es roughly three months. This allo ws the projects to be more detailed and ambitious, but also allo w s crucial scaﬀolding. In the ﬁrs t par t, students consider the design of their code but do not actuall y implement it. Instead, the y write function signatures but lea ve the bodies of the functions empty . The onl y code req uired to be submitted is e xtensiv e unit tes ts demonstrating what the code should do, encouraging students to think more deepl y about their design bef ore plunging in to implementation. The subsequent Challeng e par ts ask s tudents to successiv el y add f eatures, f ollo wing the requirements giv en in the assignment. Besides the classiﬁcation tree Challenge, other topics include applying the isolation f orest method f or anomaly detection [Liu et al., 2012] to videos, implementing f ast data structures and algorithms f or autocompletion [W a yne, 2016], and using audio ﬁngerpr inting to matc h shor t snippets of audios to a database of recorded music [Li-chun W ang, 2003]. All the Challeng es are designed to produce a w orking piece of software that is usable in a rele v ant conte xt, e.g., a packag e or librar y , a w eb or mobile app, or a command-line 13 tool. All the Challeng es are also designed to integrate multiple skills: s tudents must select appropriate data structures and algor ithms f or their code to w ork eﬃciently , while using softw are design principles to k eep it simple and easy to maintain. 4.4 Speciﬁcation-Based Grading and Re vision During the initial iteration of the course, homew ork grading w as f airl y con v entional: the teaching assistant g raded code submissions using a simple rubr ic, assigning a point value to each r ubr ic category . Ho we v er , w e quic kl y f ound this sy stem to be ineﬀectiv e, as students w ere not re vie wing the f eedback or using it to impro v e their future submissions. Beginning in F all 2016, we switched to a ne w sy stem. Students select assignments from the problem bank and submit their code through GitHub as pull requests. The teaching assistants then give detailed line-b y-line f eedback on the pull requests using GitHub’ s code re vie w f eatures. The re view s point out bugs, cr itique diﬃcult-to-read or poorl y f or matted code, sugg est more appropr iate algorithms or data structures, request additional unit tes t cases, and note design choices that make the code diﬃcult to reuse or modify . Crucially , there are only tw o possible outcomes of revie w: the assignment can be mark ed Mastered, indicating the student has successfull y sol ved the problem, has used appropriate algorithms and data str uctures, and has wr itten the code with good sty le and with unit tests that v er iﬁed its correctness; or , if those criter ia are not met, the assignment is marked “Chang es reques ted” and the s tudent is ask ed to re vise it according to the f eedback. Once re visions are complete, the student submits them f or another re vie w . This sy stem allo ws us to hold assignments to a v er y high standard. W e expect that a larg e fraction of submissions will be revised. (W e cut the required number of home work assignments in half at the same time as introducing the re vision sy stem; student w orkload has not appreciabl y dropped, sho wing that s tudents are spending much more time on eac h assignment.) The revision process giv es s tudents practice with a constellation of skills that are often neglected in instr uction and ensures the y master the practical details of the concepts co v ered in the course. One could ev en consider the code re view s to be personalized tutoring pro vided b y the teaching assistants, complementing the lectures and activities led b y the instructors. This tutor ing is what allo w s the course to co ver topics at a high lev el and e xpect students to lear n to implement them in their chosen prog ramming languag es. A similar re vision sys tem is used f or the Challeng e projects. Each par t of the Challenge is submitted to the teac hing assis tants f or re vie w as it is completed, and the s tudent mus t make satisf actor y revisions bef ore the y can submit the ne xt par t of the Challeng e. Once all par ts of the Challeng e are complete and meet the requirements, it can be graded either Mastered or Sophisticated. Sophis ticated submissions are those that demonstrate e x ceptional softw are engineer ing skill, b y being w ell-designed, clear l y written, thoroughly 14 tested, unusuall y ﬂe xible and modular , and incorporating apt choices of methods/algor ithms and data org anization. Earning a Sophis ticated g rade on the Challeng e increases the student ’ s course grade, as discussed belo w in Section 4.5. Prior research on master y lear ning sys tems sugges ts strategies like this can impro v e student learning [Kulik et al., 1990], though the additional ﬂe xibility in our sy stem distinguishes it from the more widel y used mas tery g rading sy stems. 4.5 Grading System The course structure poses challeng es f or assigning ﬁnal course grades. As descr ibed in Section 4.2, students select homew ork assignments from a bank of possible problems. Students can complete assignments in any order , and there are no ﬁxed deadlines f or submitting individual assignments, nor are there points to be a v eraged to give a ﬁnal grade. W e base grades on the number of Mas tered assignments. A simple table in the course sy llabus speciﬁes ho w man y assignments must be Mas tered to ac hie v e a cer tain grade. The Challeng e project is required, but achie ving a Sophisticated on the Challeng e can also mo v e the ﬁnal course g rade up one g rade le v el. T o account f or the fact that individual homew ork assignments ma y in v olv e diﬀerent degrees of diﬃculty , we assigned each home w ork assignment a cer tain number of “points.” T ypical assignments were 2 points, but diﬃcult assignments w ere 3 points and tr ivial assignments 1 point. The only possible outcome is s till either Mas tered, meaning the s tudent receiv es all the points, or revision, meaning the student does not y et receiv e an y points f or the assignment. There is no par tial credit f or submissions. The grade table is then based on the number of points Mastered, rather than the number of assignments, and accounts f or assignment diﬃculty , prev enting s tudents from simpl y choosing the easiest assignments to complete. W e f ound that this grading sys tem has se v eral adv antages. It is noticeably simpler than a normal points-based sy stem to g rade and adminis ter , reducing w orkload on the T As. It reduces uncer tainty f or students, who know that if they re vise their submissions as instructed, they will obtain a cer tain number of points, and these points translate into grades. There is no concer n o v er a ﬁnal e xam that heavil y aﬀects ﬁnal grades—there is no ﬁnal e xam—and students know ex actly ho w much work they mus t do f or a cer tain grade. It also giv es the students the ﬂe xibility to e xplore the problem bank to impro v e their skills and giv es them incentiv e to tac kle some of the more challenging problems. 15 5 Conclusion The trends that motivated Nolan and T emple Lang’ s call f or a ne w f ocus on computing in statis tics cur r icula ha ve only accelerated. The scope and comple xity of computing tasks e xpected of statisticians and data scientis ts require not onl y a detailed kno wledg e of statis tical methods and numer ical approaches but also skills related to data management, collaboration, and software engineer ing. Our S tatistical Computing course is designed to giv e s tudents a ﬁr m f oundation—and authentic practice—in these skills. It is intended to ser v e as a base on which their programming experience can be built throughout their graduate career , and be y ond. N o vel features of our course include emphasis on the practice of software design, our multi-path problem bank, our grading sys tem, integ rated code revie w , regular re vision, and a language-agnos tic approach. The course has been successful in both our Ph.D. and Master ’ s program. Implementation, par ticular ly at scale, is a continuing challeng e, and w e will continue to dev elop and reﬁne the course. Statis tical computing is a broad topic, and students come with v ar ied backgrounds and downs tream needs. There are man y reasonable approaches to teaching students the computing skills they will need in their careers. W e believ e, ho w e v er , that working statis ticians in industry and academia f ace increasingl y stringent demands on the capabilities, usability , and maintenance of the software the y produce, and that literate report-wr iting is onl y one component of the man y computational skills a successful s tatistician will need. As the ﬁeld’ s computing cur r icula continue to ev ol v e, w e belie ve that this reality needs to be f aced head on. 5.1 Student F eedback Our univ ersity conducts anon ymous course ev aluation surve y s at the end of each semester; student par ticipation is v oluntar y and response rates can v ar y widel y . Nonetheless, the quantitativ e data and comments from s tudents can sometimes pro vide useful inf or mation about ho w a course is being receiv ed. A ccording to the sur v ey results, s tudents in our statis tical computing course report w orking roughly 11 hours per w eek on the course, which is abov e the intended av erag e of 9 f or a course w or th its number of credits. Student comments attributed this to the f ast pace of the course: f or e xample, one student wrote that “ As the class w as designed co v ers a lot of topics that w ould take a couple semesters in nor mal C.S. courses, this course is deﬁnitel y conceptuall y diﬃcult and has a quite heavy workload.” W e do not think this is an unf air c haracterization, par ticularl y f or students with less prior programming e xper ience, and continue to adjust the curr iculum and pace based on student feedbac k. Nonetheless, s tudents enjo yed the pedagogy of the course and its ﬂe xibility . One s tudent 16 noted that “My f av orite par ts w ere the interactiv e par ts;” the interactiv e activities discussed in Section 4.1 “helped me f eel more eng ag ed and helped me understand the problems better .” Similarl y , one student noted that “I could pick and choose easier and harder assignments, and g et to e xplore ne w areas that interested me without being o v er whelmed and stressed out constantl y .” Though this ﬂexibility was appreciated, it also has its drawbac ks, as noted b y the s tudent who complained that “The home w ork sys tem also reall y opened m y e y es to ho w bad I am with time management.” With no home w ork deadlines dur ing the semes ter , some students e xper ienced a mad r ush to get the required number of assignments completed in the las t f e w w eeks. 5.2 Future Impr o v ements The topics we emphasize ha v e chang ed from y ear to y ear as our understanding of student needs hav e chang ed, and the pr ior skills of eac h s tudent cohort ha ve v ar ied signiﬁcantly from y ear to y ear . Also, the unconv entional course structure, while giving students signiﬁcant freedom to e xplore their interes ts, has required a great deal of e xper imentation to impro v e, and likel y will continue to chang e each y ear as we lear n what structure best teaches our intended skills. Se v eral challeng es remain to be sol ved. Git has pro ven to be a ma jor obstacle to students; the y must use it to submit each home work assignment on GitHub f or revie w b y the T As, but s tudents who make mistak es often attempt to ﬁx them with ad-hoc solutions f ound online, leading to tangled Git histories that mus t be carefull y un-tangled b y the ins tr uctors or T As bef ore assignments can be graded. The ﬂe xible home w ork sy stem can sometimes be too ﬂe xible, and without f or mal deadlines, students can procras tinate and g et into diﬃcult situations, or skip assignments selected as in-class activities and miss important skills needed f or the Challeng e project. (This has been a bigger issue f or Master ’ s students than f or Ph.D. students.) The demands on course T As to giv e high-quality f eedback on assignments while also holding regular oﬃce hours can be stressful, requir ing skilled T As and larg e time commitments. W e are w orking to streamline the course, automate some aspects of home work submission and re view , and impro v e the student experience. 5.3 Implementing a Similar Course For those interested in teac hing the core themes w e descr ibe in Section 3.1, comprehensiv e lecture notes are av ailable at our course website, https://36- 750.github.io/ . (The w ebsite includes more than one semester w or th of material: it includes notes on ev ery topic that has been taught in the course, e ven as the selection of topics has chang ed from y ear to y ear .) The notes include in-class activ e lear ning activities, e xample programs, and notes that 17 w ere used dur ing lectures. The home work problem bank (Section 4.2), including solutions to some problems, is kept privatel y b y the authors and is av ailable to instructors on reques t. But so far , w e ha v e left one ke y ques tion open: Who can eﬀectiv el y teach a statistical computing course on the topics we descr ibe? The f our core themes require f aculty with e xperience designing and implementing comple x software; while an introductory R programming class simpl y requires kno w ledge of syntax and some basic pr inciples, our themes include principles of algor ithms, data structures, and software design. T o giv e eﬀectiv e f eedback, the course teaching assistants must also be e xper ienced programmers who can recognize ineﬃcient algor ithms or unnecessaril y comple x designs. These cons traints limit who can teach a course cov ering the skills w e f eel are most impor tant, at least until such teaching becomes more widespread and faculty can be e xpected to ha v e these skills. It may be practical, ho we ver , to co-teach the class. An instructor e xper ienced in statis tics and data science could co v er those topics, while an instructor e xper ienced in softw are engineering, perhaps from another depar tment, pro vides the core computing content. This raises a question: Why not ha ve students tak e a computer science or softw are engineering course from another department? While a fair por tion of the material w e emphasize (including algor ithms, data structures, testing, software design, and wide-ranging assignments) might seem more naturall y obtained from a Computer Science depar tment, w e ha ve f ound man y reasons to pref er that mater ial within a Statistics curr iculum. First, w e explicitly address these themes, with signiﬁcant class time; these skills tend to be threaded more implicitly throughout a typical computer science cur r iculum. Second, w e can f ocus the practice of our targ et skills with conte xt and e xamples that are meaningful to Statis tics and Data Science students. Third, ha ving a single f oundation course earl y in the Statis tics graduate cur r iculum has been a signiﬁcant downs tream productivity enhancer f or our s tudents, who soon use the skills in their other courses and projects. Finall y , we know of no other course, in Computer Science or else where, that achie v es our targ et balance on our themes and skill dev elopment. A c kno wledgments W e thank P eter Freeman f or contributions to the course design and content during its ﬁrst iteration. W e are indebted to our e xcellent teaching assis tants f or their outs tanding w ork in the class: Philipp Burckhardt, Niccolò Dalmasso, Sangw on Hyun, Nicolás Kim, Francis K ov acs, T ay lor Pospisil, and Shamindra Shrotr iy a. Jerzy Wieczorek pro vided helpful insight on our mastery grading sys tem. W e thank the peer revie w ers and guest editor f or many sugg estions that impro v ed the manuscript. 18 R ef erences Timoth y Bar r y . Collections in R: Re view and proposal. The R Journal , 10(1):455–471, 2018. doi: 10.32614/ RJ- 2018- 037. Benjamin S. Baumer , Mine Çetinka y a-Rundel, Andrew Bray , Linda Loi, and Nicholas J. Hor ton. R Markdo wn: Integrating a reproducible anal ysis tool into introductor y statis tics. T ec hnology Innov ations in Statis tics Education , 8(1), 2014. URL https: //escholarship.org/uc/item/90b2f5xh . S. Behnel, R. Bradsha w , C. Citro, L. Dalcin, D.S. Seljebotn, and K. Smith. Cython: The best of both w orlds. Computing in Science Engineering , 13(2):31 –39, 2011. ISSN 1521-9615. doi: 10.1109/ MCSE.2010.118. Moritz Beller , Alber to Bacchelli, Andy Zaidman, and Elmar Juerg ens. Modern code re vie ws in open-source projects: Which problems do the y ﬁx? In Pr oceedings of the 11th W orking Conf er ence on Mining Sof twar e R epositories , pag es 202–211, 2014. doi: 10.1145/2597073.2597082. Jon Louis Bentle y . Multidimensional binary search trees used f or associativ e searching. Communications of t he A CM , 18(9):509–517, 1975. doi: 10.1145/361002.361007. Ricardo Bion, R ober t Chang, and Jason Goodman. Ho w R helps Airbnb mak e the most of its data. The American Statistician , 72(1):46–52, 2018. doi: 10.1080/ 00031305.2017. 1392362. Wilson Bissi, A dolfo Gusta v o Serra Seca Neto, and Mar ia Claudia Figueiredo P ereira Emer . The eﬀects of test dr iv en de velopment on internal quality , e xter nal quality and productivity : A sys tematic revie w . Inf ormation and Sof twar e T echnology , 74:45–54, 2016. doi: 10.1016/ j.inf sof.2016.02.004. Leo Breiman. Random fores ts. Mac hine Lear ning , 45(1):5–32, 2001. doi: 10.1023/ A: 1010933404324. Jennif er Bryan. Ex cuse me, do y ou ha v e a moment to talk about v ersion control? The American S tatistician , 72(1):20–27, 2018. doi: 10.1080/00031305.2017.1399928. Mine Çetinka ya-R undel and Colin R undel. Infrastructure and tools f or teac hing computing throughout the statis tical cur r iculum. The American Statistician , 72(1):58–65, 2018. doi: 10.1080/00031305.2017.1397549. 19 CMU Career & Prof essional De v elopment Center. Firs t destination outcomes: Dietr ich Colleg e Statis tics & Data Science, bachelor ’ s, 2018. URL https://www.cmu.edu/ career/documents/2018_one_pagers/dc/Bachelors%20Stats.pdf . Dirk Eddelbuettel and Romain Francois. Rcpp: Seamless R and C++ integration. Jour nal of S tatistical Softwar e , 40(8), 2011. doi: 10.18637/jss.v040.i08. Anders Eklund, Thomas E Nichols, and Hans Knutsson. Cluster f ailure: Wh y fMRI inf erences for spatial e xtent ha v e inﬂated f alse-positive rates. Pr oceedings of t he National Academy of Sciences , 113(28):7900–7905, 2016. doi: 10.1073/pnas.1602413113. Jacob Fiksel, Leah R Jager , Johanna S Hardin, and Marg aret A T aub. Using GitHub Classroom to teac h statis tics. Journal of Statistics Education , 27(2):110–119, 2019. doi: 10.1080/10691898.2019.1617089. S. Freeman, S. L. Eddy , M. McDonough, M. K. Smith, N. Okoroaf or , H. Jordt, and M. P . W enderoth. A ctiv e learning increases student per f or mance in science, engineering, and mathematics. Proceedings of t he National A cademy of Sciences , 111(23):8410–8415, Ma y 2014. doi: 10.1073/pnas.1319030111. Ale xander G. Gra y and Andrew W . Moore. N onparametric density estimation: T o w ard computational tractability . In Pr oceedings of t he 2003 SIAM International Conf erence on Data Mining , pages 203–211, 2003. doi: 10.1137/1.9781611972733.19. Joel B Greenhouse and How ard J Seltman. On teaching s tatistical practice: From no vice to e xper t. The American Statis tician , 72(2):147–154, 2018. doi: 10.1080/00031305.2016. 1270230. Nicholas A Johnson. A dynamic prog ramming algor ithm f or the fused lasso and 𝐿 0 - segmentation. Jour nal of Computational and Gr aphical Statistics , 22(2):246–260, 2013. doi: 10.1080/10618600.2012.681238. Michael I Jordan. On statis tics, computation and scalability . Bernoulli , 19(4):1378–1390, 2013. doi: 10.3150/ 12- BEJSP17. Chen-Lin C Kulik, James A Kulik, and R ober t L Bangert-Dro wns. Eﬀectiv eness of master y learning programs: A meta-analy sis. Review of Educational Resear ch , 60(2):265–299, 1990. doi: 10.3102/ 00346543060002265. Friedr ich Leisch. Sw ea ve: Dynamic generation of statistical repor ts using literate data anal y sis. In W olfgang Härdle and Ber nd Rönz, editors, Compstat 2002 — Proceedings 20 in Computational S tatistics , pag es 575–580. Ph ysica V er lag, Heidelberg, 2002. ISBN 3-7908-1517-9. A v er y Li-chun W ang. An industrial-strength audio search algorithm. In Pr oceedings of the 4th Int er national Conf erence on Music Inf or mation Re trieval , 2003. Fei T on y Liu, Kai Ming T ing, and Zhi-Hua Zhou. Isolation-based anomal y detection. A CM T r ansactions on Know ledg e Discov er y from Data , 6(1):3:1–3:39, March 2012. doi: 10.1145/2133360.2133363. M. V . Mäntylä and C. Lassenius. What types of def ects are really disco vered in code re vie ws? IEEE T r ansactions on Softwar e Engineering , 35(3):430–448, Ma y 2009. ISSN 2326-3881. doi: 10.1109/ TSE.2008.71. N ational A cademies of Sciences, Engineering, and Medicine. Data Science f or Under gr ad- uates: Opportunities and Op tions . The National A cademies Press, W ashington, DC, 2018. doi: 10.17226/ 25104. Deborah Nolan and Duncan T emple Lang. Integrating computing into the statis tics cur ricula, 2009. URL https://www.stat.berkeley.edu/~statcur/ . Deborah Nolan and Duncan T emple Lang. Computing in the statis tics cur ricula. The American S tatistician , 64(2):97–107, 2010. doi: 10.1198/tast.2010.09132. P eter C Rigb y and Chr istian Bird. Con v erg ent contemporary softw are peer revie w practices. In Pr oceedings of the 9th Joint Meeting on F oundations of Sof twar e Engineering , pag es 202–212, 2013. doi: 10.1145/2491411.2491444. A J R ossini. Literate statistical practice. In K Hornik and F Leisch, editors, Proceedings of the 2nd InternationalW orkshop on Distributed Statistical Computing , 2001. Caitlin Sado w ski, Emma Söderberg, Luk e Church, Michal Sipk o, and Alber to Bacchelli. Modern code re vie w: A case study at Google. In Proceedings of the 40th International Conf erence on Sof twar e Engineering: Sof twar e Engineering in Pr actice , pag es 181–190, 2018. doi: 10.1145/ 3183519.3183525. Ke vin W a yne. A utocomplete-me. In SIGCSE Nifty Assignments , 2016. URL http: //nifty.stanford.edu/2016/wayne- autocomplete- me/ . Hadle y W ickham. testthat: Get started with tes ting. The R Jour nal , 3(1):5– 10, 2011. URL https://journal.r- project.org/archive/2011- 1/RJournal_ 2011- 1_Wickham.pdf . 21 Hadle y Wic kham. Tidy data. Journal of Statistical Softw ar e , 59(10), 2014. doi: 10.18637/ jss.v059.i10. Hadle y Wic kham, Mara A v erick, Jennif er Br y an, Wins ton Chang, Lucy D’ Agos tino McGo wan, R omain François, Garrett Grolemund, Ale x Hay es, Lionel Henry , Jim Hes ter , Max Kuhn, Thomas Lin Pedersen, Evan Miller , Stephan Milton Bache, Kirill Müller , Jeroen Ooms, Da vid Robinson, Dana Paig e Seidel, Vitalie Spinu, Kohsk e T akahashi, Da vis V aughan, Claus W ilke, Kara W oo, and Hiroaki Y utani. W elcome to the tidyv erse. Journal of Open Sour ce Sof tw ar e , 4(43):1686, No v 2019. doi: 10.21105/joss.01686. L. Williams, E. M. Maximilien, and M. V ouk. T est-driv en dev elopment as a defect-reduction practice. In 14t h Int ernational Symposium on Sof twar e R eliability Engineering , pages 34–45, N ov 2003. doi: 10.1109/ISSRE.2003.1251029. Yihui Xie. Dynamic Documents with R and knitr . Chapman and Hall/CR C, Boca Raton, Florida, 2nd edition, 2015. URL https://yihui.name/knitr/ . ISBN 978- 1498716963. Yihui Xie, J.J. Allaire, and Gar rett Grolemund. R Markdo wn: The Deﬁnitiv e Guide . Chapman and Hall/CR C, Boca Raton, Flor ida, 2018. URL https://bookdown.org/ yihui/rmarkdown . ISBN 9781138359338. 22

Expanding the scope of statistical computing: Training statisticians to be software engineers

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment