Think-aloud interviews: A tool for exploring student statistical reasoning

Think -aloud inter vie w s: A tool f or e xploring student s tatistical reasoning Ale x R einhar t 1 , Ciaran Ev ans 2 , Amanda Lub y 3 , Josue Orellana 4 , Mikaela Me y er 1 , Jerzy Wieczorek 5 , P eter Elliott 1 , Philipp Burc khardt 1 , and R ebecca N ugent 1 1 Depar tment of S tatistics & Data Science, Car negie Mellon Univ ersity 2 Depar tment of Mathematics and Statistics, W ake Forest U niversity 3 Depar tment of Mathematics & Statistics, Swarthmore Colleg e 4 Center for the Neural Basis of Cognition and Mac hine Lear ning Depar tment, Car negie Mellon Univ ersity 5 Depar tment of Statis tics, Colb y College April 5, 2022 Abstract Think -aloud inter vie ws hav e been a v aluable but under used tool in statis tics education research. Think -alouds, in which s tudents nar rate their reasoning in real time while solving problems, diﬀer in important w ay s from other types of cognitiv e inter vie ws and related education research methods. Bey ond the uses already f ound in the statistics literature—mostl y validating the w ording of statis tical concept inv entory questions and studying student misconceptions—we sugges t other possible use cases f or think -alouds and summar ize best-practice guidelines for designing think -aloud inter vie w studies. Using examples from our o wn experiences studying the local student body f or our introductory statistics courses, we illustrate ho w research goals should inf or m study -design decisions and what kinds of insights think -alouds can provide. W e hope that our o verview of think -alouds encourages more statistics educators and researchers to begin using this method. 1 Introduction Think -aloud inter vie ws, in which inter vie w subjects solv e problems while nar rating their thinking aloud, pro vide a valuable statis tics education research tool that can be used to study student misconceptions, improv e assessments and course materials, and inform teaching. In contrast to written assessment questions or traditional inter vie ws, think -alouds inv ol ve the subject descr ibing their thoughts in real time without inter vie wer feedbac k, rather than providing e xplanations after the 1 f act or in dialogue with the inter vie wer (Er icsson and Simon, 1998; Adams and Wieman, 2011). The y diﬀer in impor tant w a ys from other types of cognitiv e inter vie w s (Leighton, 2017), such as the task -based inter vie ws in W oodard and Lee (2021), where the inter vie wer may probe the interview ee about their steps or thought process as they w ork. Think -aloud interview s better capture ho w inter vie wees think about the problem on their own, and giv e a clearer picture of the reasoning process in real time. While think -alouds and other cognitive interview s are widel y used in education research (Bo wen, 1994; Kaczmarczyk et al., 2010; A dams and Wieman, 2011; Kar pierz and W olfman, 2014; Deane et al., 2014; T a y lor et al., 2020), their use in statis tics education appears to be mostl y concentrated on dev eloping concept inv entories (Lane-Getaz, 2007; Park, 2012; Ziegler, 2014; Sabbag, 2016) and sev eral studies on student misconceptions (Konold, 1989; Williams, 1999; Lo vett, 2001). Further more, exis ting w ork in statistics does not provide e xtensive guidance on the think -aloud process to inform researchers interested in conducting their o wn inter view s. Our goal in this w ork is to advocate f or think -aloud interview s in statistics education by descr ibing details of the think -aloud process and including best practices f or how inter vie w protocols may vary f or diﬀerent research goals, so that interested readers ha ve a starting point to conduct their o wn interview s. In order to illustrate ho w the research conte xt should dr iv e design decisions and the interpretation of results, w e use one of our o wn think -aloud inter vie w studies as a concrete r unning e xample. In this informal study , appraising the accuracy of our o wn beliefs about student understanding in the introductor y statistics courses w e ha v e taught at Carnegie Mellon, we conducted think -aloud interview s with approximatel y 30 students o v er sev eral semesters. Questions co vered a selection of introductor y topics where we believ ed we kne w the most common misconceptions held by our students. Findings from the study sho wed us sev eral areas where we w ere mistak en, with clear implications for ho w we might re vise our teaching. Fur ther more, data from our earl y think -alouds led us to revise sev eral ambiguous tasks to impro v e our later think -alouds. The par ticular results w e present here are not meant to g eneralize bey ond our student population. Rather , w e include them in the hope that they inspire other statistics educators and researchers to see the value of using think -alouds in their o wn work. In Section 2, w e descr ibe the motivation f or think -aloud inter vie ws, and contrast them with other tools like concept inv entories. Since think -alouds can be used f or a variety of research goals in statis tics education, we comment on ho w protocols may need to chang e to support diﬀerent types of research. In Section 3 we summar ize best-practice guidelines f or think -aloud inter vie ws (Leighton, 2017), and w e describe our o wn think -aloud protocol in order to illustrate how these guidelines ma y be applied to tailor a study design to par ticular research goals. In Section 4, we share sev eral ﬁndings from our think -aloud inter vie ws to demonstrate ho w we inter preted these results in light of our o wn research goals, f ocused on our students and our o wn teaching. Through these case studies, w e emphasize ho w think -alouds provided new inf or mation about our students which w e had not observed in traditional interactions such as class discussions, oﬃce hours, and wr itten assignments. Although some of the misconceptions we obser v ed had already been discussed in the statistics education literature, we had not previousl y been a ware of them in our own students. 2 2 Bac kground on think -aloud intervie ws 2.1 Think -alouds vs. related methods There is a spectr um of wa y s that instructors lear n how their students think. At one end of the spectrum, instr uctors can talk to students as par t of the course: through questions in class, in after -class chats, dur ing oﬃce hours, and in oral ex ams. These conv ersations are intended to serve the course and the students, but also pro vide instr uctors with glimpses into student thinking. At the other end of the spectrum, there is a rang e of tools f or more detailed research insight into student thinking, including concept inv entories and sev eral varieties of cognitive inter vie ws. Concept inv entor ies are wr itten assessments designed to co ver speciﬁc concepts. Se veral ha ve been designed f or introductor y statistics course topics, such as 𝑝 -v alues and statistical signiﬁcance (Lane-Getaz, 2007), inf erential reasoning (Park, 2012), and statistical literacy (Ziegler, 2014; Sabbag, 2016). Using a pre-e xisting concept inv entor y to assess student thinking has a lo w time cost f or the instructor , since items hav e already been wr itten and validated. They can also be administered en masse to students since the y are typically multiple choice and can be auto-graded. Ho we v er, the y oﬀer lo w customizability for an instructor or researcher interested in a topic not cov ered, and since no information is recorded on student thinking bey ond their answer choice, it is hard to assess the reason behind incor rect answers—unless the questions were speciﬁcally wr itten to detect common misconceptions and the test was validated f or this pur pose, as with cer tain items on the R easoning about P -values and Statis tical Signiﬁcance (RP ASS) scale (Lane-Getaz, 2007). Interview s with students pro vide richer oppor tunities to co v er speciﬁc topics and understand student thinking. Note these interview s are distinct from oral e xaminations; while oral e xams ma y be useful f or assessing student understanding (Theobold, 2021), associating g rades with think -aloud interview s can inhibit the ability to accurately capture student thought processes, as discussed by Leighton (2013, 2017). It is impor tant in think -alouds to reassure students that the inter vie w is non-e valuativ e so the y are comf or table sharing their thoughts. Inf or mal discussions in class or in oﬃce hours can be a less-ev aluativ e wa y to understand students ’ thinking, but such con v ersations are unstructured inter v entions primar il y meant to ser v e the students in achie ving a course ’ s lear ning objectiv es, not to car r y out structured research into what they did or did not understand bef ore an intervention. In the conte xt of research, as opposed to assigning grades to students, there are sev eral kinds of cognitiv e inter view s used. In one branch of cognitiv e inter vie ws, the inter vie wer makes structured inter ruptions throughout the inter view to ask about the volunteer ’ s thought process. Such interview s ha ve been ter med task -based inter vie ws (W oodard and Lee, 2021), v erbal probing (Willis, 2005), or cognitiv e laborator y inter view s (Leighton, 2017). F or instance, a student v olunteer completes a course-related task while the inter vie wer prompts the student with questions about ho w they chose their answ er , reques ts feedbac k on the diﬃculty of the task, or asks leading questions to help guide the student back if the y go too far oﬀ trac k. Similar varieties of cognitive inter vie ws are also widely used in sur v ey instrument design to ensure sur v ey ques tions are cor rectly inter preted and measure the intended constr ucts (Willis, 2005) or in software design to impro ve the usability of a inter f ace 3 (Nielsen and Landauer, 1993), in which cases the inter view er may explicitl y solicit the v olunteer’ s sugg estions about ho w to improv e the sur v ey f orm or the software inter face. While such prompting allo ws the inter vie wer to request additional details about the v olunteer’ s reasoning or pref erences, it tends to cause subjects to repor t their self-reﬂections about their reasoning, which ma y diﬀer from the actual reasoning process they or iginall y used. By contrast, think -alouds are a sty le of cognitiv e inter vie w that inv ol v e minimal dialogue with the interview er . Think -aloud interview s f ocus speciﬁcally on interview ee reasoning without any inﬂuence from the inter vie wer . In a think -aloud inter vie w , conducted pr ivatel y with the inter vie wee and interview er (and potentially a designated note-taker; see belo w), the inter view er asks the subject to per f or m a task but requests that the subject think aloud 1 while doing so, starting by reading the task aloud and nar rating their entire thought process up to the conclusion (Ericsson and Simon, 1998; Leighton, 2017). In contrast to task -based inter view s or tutor ing sessions, which include dialogues betw een the interview er and student, in a think -aloud inter vie w the inter vie wer neither giv es f eedback nor oﬀers clar iﬁcation until the end of the inter vie w , other than reminders to “Please think out loud” if the subject falls silent (Adams and Wieman, 2011). This provides a better ev aluation of the subject ’ s reasoning process; Er icsson and Simon (1998) sugges t that when a subject explains their reasoning onl y after completing a task, this “biased par ticipants to adopt more orderl y and r igorous strategies to the problems that were easier to communicate in a coherent fashion, but in tur n altered the sequence of thoughts, ” while “the course of the thought process can be inferred in considerable detail from thinking-aloud protocols.” That said, at times it ma y be useful to begin the interview protocol with a “concur rent ” think -aloud ﬁrst pass through all the tasks, then conclude with a “retrospectiv e” second pass in which the interview er ma y probe for more detail about how the inter vie wee understood the tasks or e xplicitly reques t feedbac k about the w ording of questions (Leighton, 2017). Branch (2000) contrasted think -alouds with “Think Afters ” and f ound that such retrospectiv e repor ts omitted many of the dead ends that her par ticipants had clearl y r un into—but did provide more detailed rationales f or cer tain steps taken, especially when tasks were so comple x or absorbing that think -aloud par ticipants did not manag e to express ev er y detail in real time. While our paper f ocuses on think -alouds for capturing real-time reasoning, our list of uses in Section 2.2 and our summar y of study -design best practices in Section 3 also draw on related e xamples from other cognitiv e inter vie w types when appropr iate. 2.2 Uses for think -aloud intervie ws Think -aloud inter vie ws hav e been used to elicit respondent thinking in a rang e of ﬁelds, including softw are usability studies (Nørgaard and Hornbæk, 2006) and man y areas of education research (Bo wen, 1994; Kaczmarczyk et al., 2010; A dams and Wieman, 2011; Kar pierz and W olfman, 2014; Deane et al., 2014; T ay lor et al., 2020). Think -aloud interview s may be useful both for 1 Although we use the ter m “think aloud” to be consistent with the literature, communication need not be verbal. The ke y is to use a real-time communication method, so that participants are relying on short-ter m working memor y in narrating while they sol v e the task, not reﬂecting on their solution after wards. For instance, R ober ts and Fels (2006) used a “ges tural think aloud protocol” in a study with sign languag e users. 4 studying general understanding and misconceptions about statistics concepts, and f or impro vements in teaching at the instructor and depar tment lev el. Belo w we descr ibe se v eral potential research uses f or think -aloud inter vie w s. 2.2.1 De veloping concept inv entories Think -alouds hav e been widely used to dev elop concept in ventories in sev eral ﬁelds, such as biology (Garvin-Dox as and Klymk o wsky, 2008; Deane et al., 2014; Ne wman et al., 2016), chemistry (W ren and Barbera, 2013), phy sics (McGinness and Sav age, 2016), and computer science (Kar pierz and W olfman, 2014; P or ter et al., 2019). In statistics, sev eral concept inv entories hav e used think -aloud interview s or similar cognitive inter vie ws in the de velopment process, including the R easoning about P -values and Statis tical Signiﬁcance instrument (Lane-Getaz, 2007), the Assessment of Inf erential R easoning in Statistics (Park, 2012), the Basic Literacy in Statistics instrument (Ziegler, 2014), and the Reasoning and Literacy Instrument (Sabbag, 2016). Cognitiv e inter vie w protocols f or the f or mer tw o instruments allo wed for structured v erbal probing by the inter vie wer , such as “What do y ou think this question is asking?” (Park, 2012), while the latter two instruments repor ted using strictly think -aloud protocols. In this use, inter vie w s help inv entory designers ensure that questions assess the intended concepts and are not misunderstood by students. Such interview s generall y f ocus on chang es to the ques tion w ording, not chang es to the concept being tested. F or instance, Lane-Getaz (2007) descr ibes how interview s prompted a change in one question from the phrase “the experiment has gone awry , ” which sev eral students did not understand, to the clearer “there was a calculation er ror , ” which helped students to focus on the statistical concept behind the question. Apar t from question w ording, think -aloud v erbal repor ts can serve as “response process validity evidence ” (Sabbag, 2016), showing that respondents answer questions by using the intended response process (statis tical reasoning) and not some g ener ic test-taking strategy . This e vidence can supplement other evidence for the v alidity of the concept in ventory , including the test content, internal str ucture, and other types of validity e vidence not directly addressed by think -alouds (Jor ion et al., 2015; Bandalos, 2018, Chapter 11). U nfortunately , details on the think -aloud protocols f or past statistics concept inv entories are larg ely recorded in unpublished disser tations. W e believ e such details are impor tant enough to deserve a prominent place in the published literature, reaching a wider audience. As Leighton (2021) states, “The conditions f or inter vie ws [. . . ] actuall y contain cr itical inf ormation about the quality of collected data. [. . . ] A fair ly straightf orward wa y to enhance the collection of v erbal repor ts is to simply include much more inf or mation about all aspects of the procedures [. . . ] This w ould include comprehensive descriptions of the instr uctions giv en to par ticipants, procedures for the timing of tasks, probes and strategies used to mitigate reactivity in the response processes measured.” Further more, although each of these disser tations summarizes its own chosen think -aloud protocol, w e are not a ware of detailed discussion in the statis tics education literature about general best practices f or think -aloud methods or about compar isons between diﬀerent approaches. 5 2.2.2 Studying expert practice When teaching a skill that requires expertise and e xper ience, it may be helpful to conduct inter vie ws with experts to understand the speciﬁc skills students need to learn. Exper ts often are not a ware of the e xact strategies the y use to sol ve problems, and making their approaches e xplicit can help de velop instructional mater ials that better teach students to reason like e xper ts (Feldon, 2007). For e xample, Lo v ett (2001) used think -alouds with statistics instructors to determine which skills they used to sol ve each problem. Members of our research g roup are cur rentl y appl ying the same idea to study e xper t and student reasoning about probability models (Mey er et al., 2020). 2.2.3 Studying student know ledge and misconceptions Data from think -alouds ma y help to characterize how students think about a par ticular topic and identify misconceptions or misguided problem-solving strategies they may ha ve. While other structured or unstructured inter vie ws hav e been used for this pur pose much more often in statistics education, think -alouds hav e appeared in the literature a f ew times. F or ex ample, Lov ett (2001) used think -alouds in a data anal ysis activity to e xplore ho w students analyze data. K onold (1989) e xplored student reasoning about probability pr imar il y with think -alouds, but also repor ted using a fe w unplanned verbal probes. Williams (1999) explored students ’ understanding of statis tical signiﬁcance with think -alouds f ollow ed by retrospectiv e semistructured interview s. In Section 4 belo w , we descr ibe ho w think -aloud inter view s allo wed us to discov er student misconceptions about sampling distributions and histograms of which w e were previousl y una ware. While the ex amples abo v e from the statistics education literature tend to focus on qualitativ e inter pretation of v erbal repor ts, such v erbal repor ts could also be carefull y coded and used for quantitativ e anal ysis of the data. Leighton (2013) studied how a think -aloud inter vie wer ’ s por tray al of their own mathematical e xper tise, interacting with pr ior student achie v ement and item diﬃculty , can account f or v ar iability in the sophistication of students ’ response processes on think -alouds about high school math problems. 2.2.4 Impr oving course materials In Section 4 belo w , we describe how think -alouds re vealed that some of our ques tions were mis-aligned with the intended concept, and that some questions w ere confusing ev en if students understood the material. As the questions w e used in inter vie ws were often taken from our own course mater ials, think -alouds allo wed us to impro ve these mater ials f or future students, and incor porate common confusions more directly into teaching material. This is similar in pr inciple to using think -alouds for studying software usability , as was done by Nørg aard and Hor nbæk (2006). 2.2.5 Inf orming course design When asking students questions about cor relation and causation, we f ound that those we interview ed w ere often confused about when causal conclusions could be dra wn, and sometimes belie v ed 6 conf ounding variables could still be present ev en in randomized tr ials. These inter vie ws, descr ibed in Section 4, along with recent papers on teaching causation (Cummiske y et al., 2020; Lübke et al., 2020), inspired us to e xplore new labs and activities f or teaching cor relation, causation, and e xper imental design. While this w ork is still in prog ress, some inf or mation can be f ound in our eCOTS presentation (Evans et al., 2020). It is impor tant to note that reasonable think -aloud protocols ma y diﬀer betw een diﬀerent use-cases. For e xample, dev eloping concept in ventories likel y requires suﬃcient sample sizes to reliably assess v alidity , conﬁdence that the inter vie wed students are representativ e, and careful transcr iption and coding of student responses. Ho we v er, a smaller study ma y be adequate if the goal is to impro v e one ’ s own courses, rather than to generalize to a broad population. In Section 3, w e discuss such study design considerations and ho w they depend on the research goals. 3 The think -aloud process T o inv estigate student understanding in our introductor y statis tics courses, we conducted a think -aloud study across sev eral semesters in 2018–2019. In this section, we summar ize the main steps and g eneral best-practices in the think -aloud process (Leighton, 2017, Chapters 2 and 4). These steps are presented in chronological order and can apply to any think -aloud study . Ho we v er, speciﬁc details ma y chang e in diﬀerent studies, such as the length of inter view s, and choices made about questions, records, and compensation. T o illustrate this, we use our o wn think -aloud protocol as a r unning e xample throughout this section, sho wing how these general best-practices can be used to guide the design of an individual study . 3.1 Prepare resear ch plan and resources When think -aloud interview s are conducted f or research, they are considered human subjects research. In the U nited States, the y may be considered ex empt from full Institutional Re view Board (IRB) re view , but this depends on the ex act circumstances and institutional policies. After de veloping a research plan based on the follo wing steps, but bef ore car r ying out the research, check in with y our local IRB. Y ou may also wish to ensure y ou hav e funding av ailable f or recr uitment incentiv es, recording inter vie ws, and transcr ibing recordings, as w ell as av ailable team members or suppor t staﬀ to car ry out inter view s and plan other logistics (such as scheduling inter vie ws, acquiring incentiv es, and keeping track of consent f or ms or recordings). T o ensure quality and consistency , inter vie wers ma y need to be f or mall y trained by the researc h team or a single interview er may conduct all interview s. In our case , our inter vie w protocol was re vie wed and classiﬁed as Exempt b y the Car negie Mellon Univ ersity Institutional Re vie w Board (STUD Y2017_00000581). As discussed belo w , w e decided that recordings and transcr iption w ere not necessary f or our pur poses. Our depar tment ’ s administrativ e staﬀ were able to pro vide logistical suppor t, and research team members were able to conduct interview s and take notes. All inter vie w ers and note-takers w ere faculty or PhD students at Carnegie Mellon at the time of the interview s and collaborated on dev eloping the interview protocol. 7 3.2 Choose interview questions Interview questions or tasks depend on the goal of the inter view process. F or ex ample, when de veloping a concept in v entor y , the inter vie w questions should consist of the draft in ventory items, each designed to target lear ning objectives and e xpected problem-sol ving approaches or misconceptions, usually based on a re view of the literature or on the e xper ience of expert instructors. For a concept inv entory or a study on a speciﬁc misconception, w e recommend picking a nar ro w set of tasks to engag e in deeply , not a broad ar ra y of topics. In some cases, it may also be useful to begin with a round of open-ended questions, and use student answ ers to cons truct distractor answers f or multiple choice questions. When choosing interview questions, it is impor tant that the y require actual problem sol ving skills (rather than simple memor ization), and that they are not too easy or hard f or the targ et population of inter view ees. Other wise, the think -aloud process can fail to actuall y capture steps in reasoning. Park (2012) used a preliminar y think -aloud interview with an expert, to ensure that the dev eloper’ s intended response process f or each question is indeed the one used by the expert, bef ore continuing on to study students ’ response processes. In our case , our purpose in conducting think -aloud inter vie ws was to explore student under - standing in introductory statistics at our institution, and to inv estig ate whether our beliefs about student misconceptions w ere cor rect. W e theref ore drafted questions about impor tant introductor y topics such as sampling distributions, cor relation, and causation. W e drafted multiple-choice rather than open-answ er questions because we generall y wanted to chec k for speciﬁc misconceptions that w e e xpected from past experiences with our o wn students. In Section 4, we describe a small selection of the ques tions we asked dur ing inter vie ws, to illustrate our reasoning f or drafting these questions and ho w they related to speciﬁc misconceptions we had in mind. 3.3 Recruit subjects 3.3.1 Recruitment process Once inter vie w ques tions are ready , students are invited to par ticipate. In line with the pr inciple that ques tions should neither be too easy nor too hard, and as discussed by Pressley and Aﬄerbach (1995), the targ et population of subjects is often those who are still lear ning the mater ial, although this ma y v ar y depending on the research goals. When dev eloping assessment items, inter vie ws with students who hav e nev er seen the mater ial bef ore could ensure that the questions are not too easy . In other situations, inter vie wees could include former students from past semesters if the goal is to understand ho w well they retain fundamental skills o v er time. Ho we v er , a researcher ought to av oid recr uiting from their o wn course: T o best capture thought processes, the subject must not feel that they are being judged or ev aluated, par ticular ly by an interview er in a position of po wer (Leighton, 2013). Human subjects research ethics also requires that subjects not feel pressured into par ticipating. W e theref ore recommend the inter vie wer be separate from the course, and that the course instructor play no role in inter viewing or recruitment besides allo wing a separate recr uiter to contact students in the course. 8 Ev en if the course instructor is not in vol v ed in inter vie ws, students may f eel pressure to participate when the instructor is a member of the broader research team. In an attempt to minimize this pressure, recr uiters should emphasize that no identifying inf or mation about interview ees will be shared with any course instr uctor , and that par ticipation will hav e no impact on their grade in the course. These reassurances should be repeated at the beginning of each interview , as discussed belo w . In our case , our research team consisted of PhD students and faculty , and most team members had e xper ience teaching introductor y statistics or working with introductor y students. Introductory statis tics students were recruited b y a member of the research team not inv ol v ed in course instr uction. In our ﬁrst semester of interview s (Spr ing 2018), recr uitment took place at three points in the semester chosen to align questions given in the think -aloud inter vie ws to recent course mater ial. The later tw o semesters were on compressed summer timelines, so recr uitment took place only once per class. A sample recruitment scr ipt is included in our supplementar y mater ials. Students w ere oﬀered $20 to par ticipate and signed up with a member of the research team not inv ol ved in conducting inter vie w s or course instruction. Every student who volunteered to par ticipate was interview ed, including some repeat par ticipants ov er the ﬁrst semester of interview s. Each par ticipant was assigned a pseudon ym, which was used throughout the interview recording and data collection process. In total, 31 students par ticipated across three terms, resulting in 42 hour-long think -aloud interview s (33 inter vie ws with 22 students in Spr ing 2018; three in Summer 2018; and six in Summer 2019). In Section 4, w e f ocus on case studies f or a subset of ﬁve questions, which were answ ered by 24 diﬀerent students. All inter vie ws were conducted by those members of the researc h group who w ere not teaching an introductor y course at the time of the interview s. These research group members took tur ns to inter vie w ev ery volunteer and take notes dur ing inter vie ws, f ollowing common interview and note-taking protocols (see belo w). 3.3.2 Sample size and composition The number and characteristics of subjects to be recr uited depends on the research goals and the targ et population. For research on misconceptions, the data analy sis plan ma y inv ol v e coding the interview responses and car rying out statis tical inference about how often a given misconception occurs, in which case po wer analy sis ma y be used to choose the sample size. On the other hand, f or validating a concept in ventory , a sur v ey questionnaire, or a software product, the pur pose of think -alouds is not to estimate a propor tion but to ﬁnd as many as possible of the potential problems in the question wording or the software ’ s usability . F or such problem identiﬁcation studies, especially if budget or time constraints require a small sample size, pur posiv e sampling or targ eted recr uitment is often seen as appropr iate in lieu of random sampling. Researc hers ma y wish to ensure that the inter view ees are representativ e of the targ et population b y including both more-prepared and less-prepared students; diﬀerent demographic or language groups; or diﬀerent academic ma jors or programs. Park (2012) administered an earl y pilot of their concept inv entory to a class and used the results to recr uit students with div erse scores f or cognitiv e inter vie ws. In some cases researchers ma y also want to compare interview ees from courses with diﬀerent pedagogical 9 approaches, f or instance using traditional vs. simulation-based inference, and could keep a record of the te xtbooks used f or each course. For pretests of surve y questionnaires, Blair and Conrad (2011) call f or larger samples than typical in past practice. In their empir ical study on cognitive inter vie ws f or a 60-item sur v ey questionnaire, using a total of 5 to 20 interview s w ould ha ve uncov ered only about a quar ter to a half of all the w ording problems f ound by using 90 inter vie ws. When improving softw are usability in an iterative design process, Nielsen and Landauer (1993) argue f or conducting 4–5 think -aloud inter vie ws, using the results to revise the product, and repeating the process man y times to identify additional issues. Finall y , if the study in vol v es so many questions or tasks that not ev er y inter vie wee can complete them all, sample sizes should be chosen to ensure adequate co verag e per task (see belo w). Pas t statistical concept in ventories hav e reported using small sample sizes and f ew rounds of ques tion re vision: Lane-Getaz (2007) used two rounds with ﬁv e and eight students respectiv ely ; Park (2012) used tw o rounds with three and six students respectiv ely ; Ziegler (2014) used one round with six students; and Sabbag (2016) used one round with four students. T o ensure conﬁdence that most w ording problems can be detected, we encourage future de v elopers of statistics concept in ventories to conduct more rounds of w ording re visions f or a larg er total number of inter view s. Finall y , cer tain ques tion problems might be more easily detected in some demographic groups than others (Blair and Conrad, 2011). W e encourage in ventory dev elopers either to conduct think -alouds with students from a wide range of educational institutions, or to clearl y designate a restricted targ et population f or their assessment instrument based on who par ticipated in think -alouds. In our case , our goals were e xplorator y rather than inf erential, so we simply interview ed all students who v olunteered (22 students in the ﬁrst semester , three in the ne xt, and six in the last). Because students volunteered to par ticipate, our sample ma y not be representativ e of all students who take our introductor y courses, though our inf ormal sense was that our inter vie wees were roughly representativ e of the demographics of this population. W e did not record our students ’ demographics, nativ e language, or major . The introductor y statistics course is a requirement f or ﬁrst-y ear students in the college where our depar tment is located, and students hav e until the second semester of their sophomore y ear to declare a major , so man y of our students had not y et declared a ma jor . Ho we ver , this inf or mation would be cr ucial to record and repor t in studies that wish to generalize bey ond the local student population. 3.4 Conduct interview s 3.4.1 W elcome and introduction It is impor tant for the subject to f eel comf or table during the inter vie w process. As in the recruitment process, pow er dynamics between the interview er and interview ee are an impor tant consideration. In ideal circumstances, inter vie ws would be conducted by a non-e xper t in the subject mater ial to minimize the expert-no vice pow er diﬀerential; this is more f easible f or think -alouds than for other approaches, such as v erbal probing where the inter vie wer might need domain expertise. Reg ardless, the recr uiting script and introductor y scr ipt should focus on making students as comf or table as 10 possible with the think -aloud process, and inter vie wers should attempt to present themselv es as non-judgmental and suppor tive throughout the interview process. The inter vie wer should begin b y w elcoming the student; introducing themsel ves (and the note-taker , if present—see belo w); and optionally oﬀer the student a bottle of w ater . At the beginning of the inter vie w , the inter vie w er explains the interview process and the pur pose. As in the recr uitment step, it is impor tant to reassure the student that their answ ers will ha ve no impact on their grade in the course, and that the pur pose of the inter vie w is to assess the course , instructor , and/or assessment material , not the student. A sample introduction scr ipt can be f ound in the Supplementary Mater ials; it is similar to the e xample languag e in T able 2.1 of Leighton (2017). Subjects will also likel y need to sign a consent f or m ag reeing to par ticipate in research. In our case , w e did not use non-expert inter vie wers, as all team members were experts in introductory statis tics. Further more, as inter vie ws were conducted v erball y and in English, non-nativ e speakers ma y ha v e been less likel y to v olunteer, or more cautious when v oicing their thoughts. Finally , our inter view ers were mostl y male and/or white, which again could ha ve impacted which students v olunteered or ho w comfortable the y felt thinking aloud. W e attempted to mitigate these concer ns through the language in our recr uiting and introductor y scripts, and through our inter vie wers ’ non-judgmental approach to the interview process. Our introductor y scr ipt also emphasized that our purpose in inv estig ating student understanding of introductory topics was ultimately to impro ve our courses, not to ev aluate the student. 3.4.2 W arm-up Thinking aloud can be challenging, and most subjects don ’ t hav e experience with this skill. T o introduce the idea of thinking aloud, Leighton (2017) and Liu and Li (2015) recommend a w ar m-up activity in which the inter vie wee thinks aloud with a practice problem. Without such practice, students may tr y to problem-solv e ﬁrst and then justify conclusions out loud after w ards, instead of nar rating all along. This warm-up should be accessible ev en without statistical kno w ledge or , for that matter , cultural kno wledg e. In our case , f or e xample, a warm-up used in our inter vie ws was asking the student to descr ibe the steps in vol v ed in making their f av or ite kind of toast. This replaced an initial w ar m-up activity of discussing a data visualization about an American actor, which tur ned out to be unnecessar il y challenging f or novice statistics students as well as f or students unfamiliar with US television show s. 3.4.3 Intervie w questions Subjects are giv en each question in tur n, and asked to think aloud while answering. The inter vie wer does not inter r upt, e xcept to remind the inter view ee to think aloud if needed. The number of inter vie w ques tions answered by a subject will depend on the length of the ques tions and the subject’ s skills. For dev elopment of a concept inv entor y , we recommend varying the question order systematicall y to ensure equal co v erag e f or all questions. F or an exploratory 11 study like ours, question order may be varied to prior itize questions that seem to be pro voking r ich responses. For a f or mal study of par ticular misconceptions, we recommend simpl y choosing fe w ques tions ov erall, so that ev er y inter vie wee is likel y to complete all tasks. In our case , our think -aloud inter vie w sessions were structured to include ten minutes for introduction and instructions; about thir ty minutes f or students to sol v e ques tions while thinking aloud uninter r upted; and a tw enty -minute per iod at the end f or the inter view er to re view the ques tions with the student, with f ollo w-up discussion to clar ify the student ’ s reasoning as needed, and ﬁnally e xplaining the answers to the student if the y should ask. Our students answered betw een 6 and 38 questions in the thir ty-minute question period, with most students answering about 20. As we drafted more questions than one student could answ er in the allotted time, we v aried the order in which questions w ere asked f or diﬀerent students, pr ioritizing the ques tions that seemed to be tur ning up the most interesting responses. As a result, we recorded betw een 1 and 14 responses f or each interview question within each round of interview s, with a mean of 5.4. 3.4.4 Intervie w records While the subject thinks aloud, the inter vie w er or a second designated note-taker may take notes, including quotes, interesting methods used, and an y par t of the task the subject f ound confusing. Alternativel y , the interview may be video- or audio-recorded f or future analy sis. For e xplorator y think -alouds, note taking ma y be suﬃcient to identify broad themes in interview ee responses, and the time cost of transcr ibing and coding recorded inter vie ws is likel y prohibitive. Other research conte xts may require careful assessment of each inter vie w (such as detailed coding to count ho w often par ticular response strategies were used, or extended quotes to show r ich details of interview ee thinking), in which case recording is pref er red. If recordings are made, y our IRB application will need to explain ho w you will protect the anonymity and conﬁdentiality of these recordings. If students use scratch paper while w orking out their answers, this should also be kept as par t of the data for possible analy sis. In our case , our inter vie ws were conducted with one designated interview er , who sat next to the student and asked questions, and one designated note-taker , who sat at the other end of the room and took notes dur ing the inter vie w process. Both inter view er and note-taker were research group members. Although we did not record inter vie ws, after the ﬁrst sev eral think -alouds our research team dev eloped a coding structure to help note-takers ﬂag points of interest in real time. For instance, our coding noted when students misunderstood the question or used non-statistical reasoning (ques tion wording or subject matter know ledge) to reach an answer , which helped us ﬂag items that needed to be revised bef ore the y could be useful f or studying statistical kno wledg e. Our coding scheme is summar ized in the supplementar y materials. 12 3.4.5 Debrief (student) T o allo w the inter vie wer to ask clar ifying ques tions, time should be allotted for a tw enty minute debrief at the end of each interview . Impor tantly , this also pro vides an oppor tunity for the student to ask an y questions, and f or the inter view er to help the student understand the mater ial better . Leighton (2017) ter ms this a “retrospective ” por tion of the inter vie w , in contrast to the “concur rent” think -aloud por tion abo v e. If a note-taker is used, they should clearl y delineate which notes come from the concur rent vs. retrospective por tions. In our case , we allow ed twenty minutes f or the debr ief. 3.4.6 Compensation If possible, inter view ees should be compensated f or par ticipation in the research process. In our case , students w ere given a $20 Amazon gift card at the end of the inter vie w . 3.4.7 Debrief (intervie wer and note-tak er) After the inter view ee leav es, the inter view er (and note-taker , if present) should take a moment to note an y impor tant obser v ations that they did not manag e to record dur ing the inter vie w itself. In our case , the inter view er and note-taker debr ief ed together . This step typically took around ﬁv e to ten minutes. 3.5 Analyze results If recordings were made, it is g enerally useful to transcr ibe the interview s, then code them to sho w where and how often cer tain responses occur red. For instance, in an e xplorator y study on misconceptions or data-anal y sis practices, initial revie w of the think -alouds might lead to tabulation of all the strategies that diﬀerent inter vie wees used f or a task. Each of these strategies might then become a code, and the analy sis might inv ol ve reﬂecting on the frequency of each code b y task or by sub-groups of inter vie w ees. Meanwhile, f or a conﬁr mator y study , the codes should be deter mined in adv ance, such as by e xper ts determining a cognitiv e model of the response processes they e xpect students to use, along with a r ubric f or deciding which utterances could count as e vidence f or or agains t the model; responses coded b y this rubr ic can be analyzed to deter mine how well actual student beha vior matched the experts ’ model (Leighton, 2017, Chapter 4). In both cases, most codes will probabl y need to be task -speciﬁc. Ho we v er, f or dev eloping a concept inv entor y or a sur v ey , some codes might be reused across tasks, relating to ho w the items themselv es could be impro ved (e.g., confusing wording; too long; can be answered without statis tical know ledg e; etc.) as w ell as whether the inter vie wee ’ s response sho wed signs of speciﬁc expected misconceptions. F or instance, Park (2012) coded each response b y whether students got the r ight or wrong answer and also whether they used cor rect or incor rect reasoning, then repor ted how often each question had “matching” answ ers (either r ight and with cor rect reasoning or wrong and with incorrect reasoning, but not vice versa). Extended quotes from 13 the coded transcr iption can provide detail on e xactl y what stumbling blocks arose, and ma y help sugg est how to revise the item. In concept in ventory wr iteups, the dev elopers often repor t each item’ s original w ording, relev ant quotes from each inter vie wee, and consequent changes to the item. If the or iginal item was presented as open-ended, an y incor rect responses may be used to dev elop multiple-choice distractor answers. T o guard agains t idiosyncratic coding, at least tw o raters should code sev eral repor ts using the same coding scheme so that inter -rater reliability can be e valuated. If necessar y , rating discrepancies can be reconciled through discussion and the coding scheme can be improv ed. As discussed abov e, Nielsen and Landauer (1993) recommend frequent iteration cy cles of 4–5 interview s f ollow ed b y revisions. U nless the inter view tasks ha v e been extensiv el y pretested already , w e suggest planning from the start f or at least two cy cles of think -alouds—and possibly man y more, if the goal is to detect and ﬁx problems with an instrument. The ﬁrst cy cle is likel y to ﬁnd at least some of the most sev ere problems with the initial tasks or the interview protocol itself; a second cy cle at minimum allow s researchers to chec k whether chang es to question w ording or protocol introduced any new issues. If multiple revision cy cles were used, researchers ought to repor t how the y decided when to stop revising. In our case , our research team met weekl y dur ing the ﬁrst semester of inter vie ws (Spr ing 2018) and once each dur ing the ne xt two semesters of interview s (Summers 2018 and 2019) to discuss interim results and to propose question re visions or ne w items. W e planned to iterate ov er one or two cy cles of small revisions per semester , although f or ease of exposition our case studies in Section 4 f ocus on scenar ios with one major revision each. Further more, as w e did not anticipate generalizing our results bey ond our local student body , we did not plan f or recording, transcr iption, and detailed coding. W e f ound that our note-takers could record the most interesting qualitativ e takea wa y s from each inter vie w in adequate detail f or our pur poses, though such notes may not hav e been suﬃcientl y detailed or reliable for other research goals. W e pro vide sev eral e xamples in Section 4. 4 Case studies In Section 3, we descr ibed the general think -aloud process, and speciﬁc details f or our think -aloud interview s with students in introductor y statis tics courses. Our goal in conducting these inter vie ws w as to explore misconceptions in introductor y students at our univ ersity , and w e compiled questions to targ et misconceptions w e had encountered through interactions with students in class, oﬃce hours, and assignments. In this section, we describe our experiences with think -aloud inter view s f or se v eral questions. W e f ocus on ﬁv e questions in which students produced unexpected answ ers which rev ealed misconceptions of which w e were previousl y una ware, and which motivated us to reconsider ho w we taught these topics. W e also take the oppor tunity to sho w how an earl y round of think -alouds can lead to revisions that make the tasks more eﬀectiv e in later think -alouds. These ﬁv e questions were tested in think -aloud interview s across 24 diﬀerent students. W e will use numbers 1–17 to denote the students directly quoted or paraphrased in this paper . 14 4.1 Sampling distributions and histograms U nderstanding v ar iability and sampling distributions is an impor tant par t of the GAISE College R epor t guidelines (GAISE Colleg e Report AS A R evision Committee, 2016), but w e ha ve noticed that students often struggle with these concepts. The introductor y statistics course at Carnegie Mellon de votes substantial time to sampling distributions, showing students through simulation-based activities ho w the shape and variance of the sampling distribution of the mean chang es as w e chang e the sample size. These activities include sampling from diﬀerent population distributions, to demonstrate ho w the central limit theorem applies e ven when the original distribution is decidedly non-normal. Ho we v er, in our experience students often str uggle to understand the idea that the variance of the sample mean decreases as sample size increases. T o explore student reasoning about variability within sampling distributions and sample size, w e drafted a question in which students had to visuall y identify a decrease in variance, and then connect this with an increase in sample size. Ho w ev er , think -aloud inter vie ws sho wed that students misinter preted the histograms w e used to displa y the sampling distributions, and also rev ealed potential misconceptions about nor mality of sampling distributions vs. nor mality of the population. This inspired us to revise the or iginal question, draft a ne w question, and conduct fur ther think -alouds to explore misconceptions. 4.1.1 Original question Figure 1 show s the study-time ques tion, intended to test understanding of sampling distr ibutions and sample size. W e expected that students who did not understand the relationship between sample size and variance of the sample mean would not know ho w to choose the cor rect answ er; but they might still get it par tially cor rect if the y remembered the approximate nor mality of the mean ’ s sampling distr ibution. W e were cur ious to see what other strategies students might use f or this problem if the y did not recall either of these tw o concepts. The intended answer was that histogram B is the population distribution, histogram A the sampling distribution of 𝑋 when 𝑛 = 5 , and histogram C the sampling distr ibution with 𝑛 = 50 . 4.1.2 Student responses T o our sur prise, all nine students who answered this question dur ing think -aloud inter vie ws got it wrong, claiming that the sampling distr ibution of the mean with 𝑛 = 5 should be g raph C in Figure 1. No students appeared to use the idea of nor mality of sampling distr ibutions in their reasoning f or this ques tion, and onl y one student noted that v ariance should decrease with increasing sample size in a sampling distr ibution (Student 1). No others indicated paying attention to variability . Three students confused the sample size with the number of bars in the histogram, with one student commenting that “small 𝑛 means fe w bars ” (Student 2) and then concluding that a sampling distribution with 𝑛 = 5 should ha ve the f ew est bars (graph C). Another student admitted, in the retrospective por tion of the interview , to not having thought about the sample av erage at all, just the distr ibution of the sample (Student 3). This sugges ted the question was not capturing the reasoning it was intended to capture: 15 A Hours Frequency 0 2 4 6 8 0 10 20 30 40 50 60 B Hours Frequency 0 2 4 6 8 0 10 20 30 40 C Hours Frequency 0 2 4 6 8 0 20 40 60 80 100 Figure 1: ( study-time , original v ersion) T o estimate the av erag e number of daily hours that students study at a large public colleg e, a researcher randomly samples some students, then calculates the a verag e number of dail y study hours for the sample. Pictured (in scrambled order) are three histograms: One of them represents the population distribution of number of hours studied; the other tw o are sampling distributions of the av erage number of hours studied 𝑋 , one f or sample size 𝑛 = 5 , and one for sample size 𝑛 = 50 . Circle the most likel y distribution f or each description. • P opulation distr ibution: A B C • Sampling distribution f or 𝑛 = 5 : A B C • Sampling distribution f or 𝑛 = 50 : A B C 16 students were selecting histograms b y matching 𝑛 to the number of bars, not necessarily by reasoning about the variance of the mean of samples of varying sizes. This is related to a previousl y-studied misconception, of which we were una ware, that students mistake the bar heights in a histogram as the obser v ed values in a dataset, and the number of bars as the number of obser vations (Kaplan et al., 2014; Boels et al., 2019). A dditionally , tw o of the nine students commented that the population should be nor mally distributed and hence selected graph A as the population distr ibution, arguing that it was the most symmetric (S tudents 2 and 4). Previous research has also identiﬁed students incor rectl y thinking that distributions besides sampling distributions should ha ve characteristics of the normal distribution (Noll and Hancock, 2015). 4.1.3 Re vision Based on these think -aloud results, we took tw o steps to f ollo w up on the misconceptions that w ere unco vered. First, would students still fail to relate the spread of the sampling distribution to the sample size if they w ere not misreading the histograms and statis tical jargon? W e revised the or iginal study-time ques tion to use mechanis tic languag e without mathematical notation, b y replacing the initial ques tion te xt with the f ollowing descr iption and asking students to match A, B, and C to Jer i, Ste v e, and Cosma, keeping the ﬁgure the same: Jeri, Ste ve, and Cosma are conducting sur v ey s of how man y hours students study per da y at a larg e public university . Jeri talks to tw o hundred students, one at a time , and adds each student ’ s answ er to her histogram. Ste v e talks to tw o hundred groups of 5 students . After asking each g roup of 5 students ho w much they study , Ste v e takes the group’ s a v erage and adds it to his histogram. Cosma talks to tw o hundred groups of 50 students . After asking each group of 50 students ho w much they study , Cosma takes the group’ s a v erage and adds it to his histogram. The three ﬁnal histograms are sho wn belo w , in scrambled order . Because the number of points in each histogram—tw o hundred—was e xplicitly stated in each case, w e hoped that students w ould no longer answer incor rectl y due to misreading the histograms. This v ersion also does not use the ter m “sampling distribution ”, so it tests whether students recognize the concept without seeing the ter m. Second, we also drafted a new question to fur ther e xplore the potential misconception that populations are alw a ys nor mally distributed. W ould students still hav e this misconception when w e are not directly asking about the tr icky topic of sampling distr ibutions? The farm-areas ques tion, sho wn in Figure 2, descr ibes a situation in which the entire population is sur v ey ed, and a histogram of the results prepared, along with histograms of samples— not sampling distributions—of sizes 𝑛 = 20 and 𝑛 = 1000 . Three possible sets of histograms are pro vided, and students are ask ed to select the most plausible set based on their shapes. The intended answ er , (A), show s a ske w ed population 17 distribution and tw o ske wed samples. The ﬁrst distractor , (B), is meant to test whether students are willing to believ e that a population could be nor mally distr ibuted e v en if a larg e sample has a ske wed distribution. The second distractor , (C), was included to test the opposite misconception: that the distribution of a sample w ould appear nor mal, ev en if the population does not. W e e xpected students to choose answer (C) if the y confused the distr ibution of a sample with the sampling distr ibution (Lipson, 2002; Chance et al., 2004; Castro Sotos et al., 2007; Kaplan et al., 2014). 4.1.4 More student responses In tw elv e ne w think -aloud inter vie ws on the revised study-time ques tion, nine students answered cor rectl y . Ho we ver , three of those nine still confused the number of bars with the sample size, as did one student who answered incor rectl y . These f our students misread the te xt and thought that there were 200 students total, so that Cosma had f our groups of 50. When combined with the histogram-bars-as-data-points misconception, the y cor rectly matched Cosma to graph C despite making tw o major mistakes in reasoning. Another student who answered incor rectl y did use cor rect reasoning about the nor mality but not the spread of sampling distributions; they wrongly matched Cosma ’ s larg er groups of students with g raph A because it looked more normal (Student 5). Of the remaining cor rect answ ers, ﬁve students ref erenced the nor mality or spread of the distribution of means, saying things like “taking the av erag e of a larg er group should lead to the means being all bunched up in one place” (Student 6). In shor t, more students did appear to use some of the intended reasoning in answ er ing this question than in its or iginal version, although this ques tion w ould beneﬁt from fur ther rounds of revision. As with other misinter pretations of histograms that ha ve been previousl y repor ted in the literature (Kaplan et al., 2014; Cooper, 2018; Cooper and Shore, 2008), students continued to misinter pret the meaning of histogram bars. T en students answ ered the farm-areas ques tion dur ing think -aloud inter vie w s, of whom onl y four selected the intended answer . The remaining six split e v enly between the two distractor answ ers, reinf orcing the notion that some of our students do hold misconceptions about nor mality of populations and about samples vs. sampling distributions. Among those selecting the ﬁrst distractor (ro w B in Figure 2), one explained that with a larg er sample size, “there is less of a chance f or data to vary” (Student 7), and the distractor had the most “centralized” population distribution. In the retrospectiv e por tion of the inter view , the student conﬁr med that this meant they had been expecting to see a symmetr ic population distribution. Among students selecting the second distractor (ro w C in Figure 2), one noted “I’m assuming it’ s looking f or a normal distribution, the greater the sample size” (Student 8) and indicated that the choice had a more nor mal histogram f or 𝑛 = 1000 , sugg esting that the y were indeed looking f or the normality that w ould be e xpected if these w ere sampling distributions rather than samples. 4.1.5 Discussion In this case, think -aloud inter vie ws allow ed us to identify misconceptions we were unaw are of, and draft some ne w mater ials to fur ther explore these misconceptions. These exploratory results, 18 A: F ar m area (sq. km) Count P opulation F ar m area (sq. km) Count Sample, n = 1000 F ar m area (sq. km) Count Sample, n = 20 B: F ar m area (sq. km) Count P opulation F ar m area (sq. km) Count Sample, n = 1000 F ar m area (sq. km) Count Sample, n = 20 C: F ar m area (sq. km) Count P opulation F ar m area (sq. km) Count Sample, n = 1000 F ar m area (sq. km) Count Sample, n = 20 Figure 2: ( farm-areas ) F ar mer Bro wn collects data on the land area of farms in the US (in square kilometers). By sur v eying her f ar ming fr iends, she collects the area of ev ery farm in the US, and she makes a histogram of the population distr ibution of US farm areas. She then takes two random samples from the population, of sizes 𝑛 = 1000 and 𝑛 = 20 , and plots histograms of the values in each sample. One of the ro ws belo w sho ws her three histograms. Using the shape of the histograms, choose the cor rect ro w . 19 ho we v er, do not by themselv es e xplain why students hold these misconceptions, and it is unclear whether misunderstandings ar ise due to the wa y histograms and sampling distr ibutions are presented in our statis tics courses. Further research could use think -alouds as one tool to explore how students think about sampling, perhaps in conjunction with speciﬁc teac hing inter v entions. In the shor t ter m, w e hav e begun to directly address these misconceptions when teaching students about histograms and about the distinctions betw een populations, samples, and sampling distributions. The questions and g raphs presented here are b y no means fully polished, and additional think - alouds could be used to fur ther impro v e and reﬁne them. For instance, the lack of marked x- and y-axis scales in farm-areas ma y hav e introduced new confusion—distinct from the histogram-reading diﬃculties we already uncov ered—that should be addressed in future rounds of re visions and new think -alouds. Ho we v er, ev en in unpolished f or m, these ques tions ha ve prov ed useful f or our pur poses of in ves tigating our students ’ understanding. 4.2 Correlation and causation The role of random assignment in dra wing causal conclusions is emphasized b y the GAISE guidelines, under the goal that students should be able to “e xplain the central role of randomness in designing studies and dra wing conclusions ” (GAISE College Report AS A Re vision Committee, 2016). Our introductory courses ha ve theref ore emphasized the diﬀerence between randomized experiments and observational studies, and that cor relation does not necessar ily imply causation. A ctivities include e xamples of data analy ses in which students critique the language used to discuss causation vs. obser v ation, and identify instances in which causal conclusions hav e been incor rectl y drawn. For think -aloud inter view s, w e drafted two questions on cor relation and causation, based on our o wn class mater ials. Ho w ev er , think -aloud inter vie w s suggested that some students were un willing to ever draw causal conclusions, a misconception we targ eted with a new question in a second round of interview s. 4.2.1 Initial questions In clinical-trial , a randomized experiment suppor ts a causal conclusion, while in books , an observational study does not suppor t a causal claim. T able 1 sho ws the initial questions and answ er choices. W e expected that among our students who had just begun lear ning about these topics, the most common mistak e would be the one that our courses usually tr y to prev ent: making causal claims where they are not warranted (in the books question). 4.2.2 Student responses For clinical-trial , the intended answer choice (B) is that vitamin C causes fas ter reco very from colds, because the study descr ibed is a randomized experiment. In think -aloud interview s, f our of six students answered cor rectly ; ho we v er, none of these four students referred to random assignment as they thought aloud. T w o students who answ ered cor rectl y told us they s trongl y believ ed that “cor relation does not equal causation, ” but still pick ed the intended answ er because it made sense 20 clinical-trial (or iginal) A clinical tr ial randomly assigned subjects to receive either vitamin C or a placebo as a treatment for a cold. The tr ial f ound a statis tically signiﬁcant negativ e cor relation between vitamin C dose and the duration of cold symptoms. Which of the follo wing can we conclude? A. Reco vering fas ter from a cold causes subjects to take more vitamin C. B. T aking more vitamin C causes subjects to recov er faster from a cold. C. W e cannot draw an y conclusions because cor relation does not imply causation. D. W e cannot draw an y conclusions because assignment w as random instead of systematic. books A surve y of Californians f ound a statisticall y signiﬁcant positiv e cor relation betw een number of books read and nearsightedness. Which of the follo wing can we conclude about Californians? A. Reading books causes an increased risk of being nearsighted. B. Being nearsighted causes people to read more books. C. W e cannot deter mine which factor causes the other , because cor relation does not imply causation. D. W e cannot draw an y conclusions because Californians aren ’t a random sample of people. T able 1: Initial questions, with answer choices, on cor relation and causation. to them that vitamin C actually w ould cause subjects to recov er faster from a cold (Students 9 and 10). One said y ou “usually can’ t assume causation” (Student 9), then pick ed the causal answer despite hesitating and stating that it is just cor relation. While students ma y get questions (par ticularl y multiple choice ques tions) cor rect f or the wrong reason, or just b y guessing, in initial think -alouds with clinical-trial w e saw students answ er ing the question cor rectly despite tr uly believing the opposite conclusion (that “cor relation does not equal causation”). Fur thermore, of the two who answ ered incor rectly , both chose answer C, refusing to make causal claims despite the random assignment. One belie ved that y ou can only e ver talk about signiﬁcance, not causation (Student 11), while the other stated they did not see any diﬀerence between this question and books (Student 12). On the other hand, in think -alouds for books , four of ﬁv e students chose the intended answ er , (C): “W e cannot deter mine which factor causes the other , because cor relation does not imply causation.” In their responses, students said nothing to indicate they understood when causal conclusions could actuall y be dra wn; one student explicitl y stated a misconception that “cor relation does not impl y causation is a univ ersal r ule ” (Student 11). The ﬁfth student, who answ ered incor rectly , tr ied to use elimination instead of statis tical reasoning (Student 13). 4.2.3 Re vision Student responses to books and clinical-trial sugg est that our students w ere generall y o vercautious about drawing causal conclusions. They clung to the mantra “cor relation is not causation ” and based their causal claims not on statistical grounds of study design but on subject- 21 clinical-trial (revised) A clinical tr ial randomly assigned subjects to either practice mindfulness meditation or a placebo relaxation ex ercise as a treatment f or a cold. The trial found that subjects who practiced mindfulness meditation had a shor ter time to reco very than students assigned to the relaxation ex ercise, and the result was statisticall y signiﬁcant. Which conclusion does this suppor t? A. Reco vering fas ter from a cold causes subjects to meditate. B. Mindfulness meditation causes subjects to recov er faster from a cold. C. W e cannot draw an y conclusions because cor relation does not imply causation. D. W e cannot draw an y conclusions because assignment w as random instead of systematic. font-test Prof essor Smith wants to know if typing her introductor y statistics e xams in Comic Sans will improv e their ex am performance. T o answer this question, she randoml y gives half of the 200 students in her class an e xam with all of the questions typed in Comic Sans, while the other students get the same exam with questions typed in Times Ne w Roman. After compar ing the ex am scores across both g roups of students, Professor Smith ﬁnds that the students who were giv en the e xam typed in Comic Sans had a higher av erage grade on the exam, compared to the av erag e grade f or students who did not receiv e the ex am typed in Comic Sans. Professor Smith repeats this e xper iment across multiple semesters of her course and alwa ys sees the same result. Which of the follo wing is tr ue? A. The result is statistical e vidence that giving students e xams typed in Comic Sans will lead to higher ex am scores across the class. B. All teachers in ev er y subject should print their e xams in Comic Sans to impro ve their students ’ per f or mance. C. Professor Smith can ’t dra w an y conclusions from these tests because other factors, such as the amount of hours students spent studying, might also aﬀect their ex am results. D. Professor Smith can ’t dra w any conclusions from these tests because she randomly decided which students w ould receiv e the exam typed in Comic Sans instead of choosing students sys tematically , such as giving onl y the female students the e xam typed in Comic Sans. T able 2: R evised and new questions, with answ er choices, about cor relation and causation. matter plausibility . From this original pair of ques tions, w e could not tell whether confusion about causation arose mostl y because they w ere pr imed by this mantra, or whether students tr uly misunderstood the roles of random assignment and conf ounding variables in making causal claims. W e theref ore drafted an additional question, font-test (T able 2), which e xplicitly descr ibed a randomized e xperiment and included distractor answ ers f ocusing on conf ounding variables and random assignment—but using mechanis tic language that was intended to av oid the technical ter ms “cor relation ” and “causation ” (as w ell as “statis tically signiﬁcant”). The intended answer was (A), while we e xpected students who misunderstood conf ounding or study design to select answers (C) or (D). Additionall y , we chang ed the treatment in clinical-trial from vitamin C to mindfulness meditation, so that the treatment ’ s eﬃcacy w ould seem like more of an open question. W e also softened the wording of the question from “Which of the f ollowing can w e conclude?” to “Which conclusion does this suppor t?” 22 4.2.4 More student responses The incor rect responses to font-test (three of six cor rect) indicated misunderstandings be y ond mere reliance on the mantra “cor relation is not causation”: one student conﬂated random sampling with random assignment (Student 14), and another thought of possible conf ounding factors and did not notice the random assignment (Student 15), while the third noticed random assignment but still thought there w ere conf ounding variables (Student 16). Of the three cor rect responses, tw o sho wed an understanding that random assignment allo ws f or causal claims, but the third wrongly justiﬁed the causal claim with random sampling (Student 17). F or the revised clinical-trial , all ﬁve students incorrectly answered that cor relation does not impl y causation, which reinf orced our suspicion that cor rect answ ers to the original version stemmed from the plausibility of vitamin C treatment rather than from understanding statis tical concepts. One student thought there could still be conf ounding variables (Student 16), while another seemed to believ e that cor relation could nev er impl y causation (Student 17), a belief also expressed in response to font-test : • “When can we ev er sa y something causes something else?” ( font-test , Student 14) • “I think the w ord ‘causes ’ is too strong... my friend who’ s a stats major alwa y s tells me y ou can ’ t say this causes that—there’ s alwa ys other factors ” ( clinical-trial , Student 14) 4.2.5 Discussion It seems that while our students had lear ned that cor relation does not impl y causation, they struggled more with understanding ho w randomized e xper iments could pro vide evidence f or causal conclusions. This matches with se veral studies in the statis tics education literature: Pfannkuch et al. (2015) and Sa wilow sky (2004) observed that students believ ed confounding w as possible ev en in randomized e xper iments, while Fr y (2017) discusses the misconception of “not believing causal claims can be made e ven though random assignment was used.” W e hypothesize that an incomplete understanding of conf ounding variables, and why randomized e xper iments pre v ent conf ounding, ma y be par t of the confusion with drawing causal conclusions. If this is the case, it may be helpful to include additional causal inf erence mater ial in the introductor y cur riculum; inspired by these think -alouds, and the w ork of Cummiske y et al. (2020) and Lübke et al. (2020), we ha ve introduced simple causal diagrams into an introductor y statistics course. Some preliminar y discussion on this ne w material can be f ound in Ev ans et al. (2020). It is also possible that students were confused because our courses hav e o veremphasized observational studies and under -emphasized randomized e xper iments, or simply because of question w ording—as in Section 4.1, the questions presented here could be improv ed b y fur ther think -alouds. In future w ork, w e hope to fur ther e xplore why our students hesitate with causation. 23 5 Conclusion In our e xper iences with think -aloud inter vie ws, we hav e seen that think -alouds provided a valuable tool f or inv estig ating student understanding of introductor y statis tics concepts. By conducting interview s with students in our o wn courses, w e learned that we had not adequatel y anticipated cer tain misconceptions about histograms, sampling distr ibutions, and cor relation and causation. Our ﬁndings so far hav e inspired us to plan f or future think -aloud inter vie w s where we will fur ther e xplore our students ’ reasoning about study design, data analy sis, and statis tical inference. F or e xample, we hope to conduct future inter vie ws in which students conduct or assess data analy sis tasks, to see which choices students make (and in what order) when working with data. Many of the steps from Section 3 w ould be similar , but we w ould need to carefull y choose questions that provide enough structure that the y can be completed dur ing an inter vie w , while still allo wing students to make diﬀerent choices. The wa y w e designed our e xisting think -aloud study was suited to our par ticular needs. The real-time nature of think -alouds allo wed us to gaug e how well students ’ statistical thinking had become internalized, rather than limited to the more deliberate, self-conscious reﬂection we w ould ha ve seen with v erbal probing or dur ing oﬃce hours. By using a process with more than one iteration—conduct se veral think -alouds, reﬂect on student responses, revise questions or draft new ones, and repeat—we were able to adapt quic kly and f ollow up on sur prising ﬁndings, unlike with a static concept inv entory . F inally , as a research group composed of instructors with a common student population, our shared discussions of student responses prompted buy-in to making chang es to our o wn courses, including new mater ial designed to address the misconceptions w e w ere seeing. Of course, we present this as just one ex ample of implementing think -alouds, and other situations will call for a diﬀerent approach. W e hope that our e xper iences encourag e other statistics education researchers to use think -aloud interview s, whether the y are in v estigating misconceptions, writing questions to assess a single concept, or revising a full concept in v entor y . Like wise, w e hope that our summary of best practices will help others tailor their own think -aloud study designs to their institutional conte xts and research problems. A c kno wledgments W e are grateful to the editor , associate editors, and revie wers f or their many helpful comments. Thanks to the Car negie Mellon’ s Eberl y Center f or T eaching Ex cellence and Educational Innov ation f or initial suppor t dev eloping this study ; to David Ger ritsen f or initial advice on conducting think - aloud inter vie ws; and to Gordon W einberg f or f eedback and sugges tions f or questions, and f or f acilitating administration of the assessment to his courses. W e also thank Sangwon Hyun, R on Y urko, and Ke vin Lin f or contr ibuting questions and assisting with think -aloud inter vie w s. W e are grateful to Christopher P eter Makr is f or e xtensive logistical suppor t. Man y thanks to our student par ticipants, without whom this research would not ha v e been possible. 24 Supplemental Materials The recruiting scr ipt, inter vie w protocol, and coding scheme are included as supplemental mater ial. R efer ences A dams, W . K. and Wieman, C. E. (2011). De velopment and V alidation of Instruments to Measure Learning of Exper t-Like Thinking. International Journal of Science Education , 33(9):1289–1312. Bandalos, D. L. (2018). Measurement Theor y and Applications f or the Social Sciences . Guildf ord Press, Ne w Y ork, NY . Blair , J. and Conrad, F . G. (2011). Sample size for cognitiv e inter vie w pretesting. Public Opinion Quarter ly , 75(4):636–658. Boels, L., Bakk er, A., V an Dooren, W ., and Dr ĳv ers, P . (2019). Conceptual diﬃculties when inter preting histograms: A revie w . Educational Researc h Review , 28:100291. Bo wen, C. W . (1994). Think - Aloud Methods in Chemistry Education: U nderstanding Student Thinking. Journal of Chemical Education , 71(3):184. Branch, J. L. (2000). In ves tigating the inf or mation-seeking processes of adolescents: The value of using think alouds and think afters. Library & Information Science Resear c h , 22(4):371–392. Castro Sotos, A. E., V anhoof, S., V an den N oor tgate, W ., and Onghena, P . (2007). S tudents’ misconceptions of statis tical inf erence: A revie w of the empir ical evidence from research on statis tics education. Educational Resear ch Review , 2(2):98–113. Chance, B., delMas, R., and Garﬁeld, J. (2004). Reasoning about sampling distr ibutions. In Ben-Z vi, D. and Gar ﬁeld, J., editors, The Challeng e of Dev eloping Statistical Liter acy , Reasoning and Thinking , chapter 13, pages 295–323. Kluw er Academic Publishers. Cooper , L. L. (2018). Assessing students ’ understanding of variability in graphical representations that share the common attr ibute of bars. Journal of Statistics Education , 26(2):110–124. Cooper , L. L. and Shore, F . S. (2008). Students ’ misconceptions in inter preting center and variability of data represented via histograms and stem-and-leaf plots. Journal of Statistics Education , 16(2). Cummiske y , K., Adams, B., Pleuss, J., T ur ner , D., Clark, N., and W atts, K. (2020). Causal inf erence in introductory statistics courses. Journal of Statistics Education , 28(1):2–8. Deane, T ., Nomme, K., Jeﬀer y , E., Pollock, C., and Birol, G. (2014). De v elopment of the Biological Exper imental Design Concept Inv entory (BEDCI). CBE—Lif e Sciences Education , 13(3):540–551. 25 Ericsson, K. A. and Simon, H. A. (1998). Ho w to Study Thinking in Ev er yda y Life: Contrasting Think -Aloud Protocols With Descr iptions and Explanations of Thinking. Mind Culture, and Activity , 5(3):178–186. Ev ans, C., Reinhart, A., Burckhardt, P ., N ugent, R., and W einberg, G. (2020). Exploring how students reason about cor relation and causation. url: https://www.causeweb.org/cause/ ecots/ecots20/posters/2- 03 . Pos ter presented at: Electronic Conf erence On T eaching Statis tics (eCOTS). Feldon, D. F . (2007). The implications of research on expertise for cur riculum and pedagogy . Educational Psyc hology Review , 19(2):91–110. Fry , E. (2017). Introduct or y statistics students’ conceptual understanding of study design and conclusions . PhD thesis, Univ ersity of Minnesota. G AISE College Report AS A Re vision Committee (2016). Guidelines f or Assessment and Instruc- tion in Statis tics Education College Report. https://www .amstat.or g/education/guidelines-for - assessment-and-instruction-in-statis tics-education-(gaise)-repor ts. Garvin-Dox as, K. and Klymk o wsky , M. W . (2008). U nderstanding randomness and its impact on student lear ning: Lessons lear ned from building the Biology Concept Inv entor y (BCI). CBE—Lif e Sciences Education , 7(2):227–233. Jorion, N., Gane, B. D., James, K., Schroeder , L., DiBello, L. V ., and Pellegrino, J. W . (2015). An anal ytic framew ork f or e v aluating the v alidity of concept inv entory claims. Jour nal of Engineering Education , 104(4):454–496. Kaczmarczyk, L. C., Petrick, E. R., East, J. P ., and Her man, G. L. (2010). Identifying student misconceptions of prog ramming. In Proceedings of the 41st A CM T echnical Symposium on Computer Science Education , SIGCSE ’10, page 107–111, New Y ork, NY , US A. Association for Computing Machinery . Kaplan, J. J., Gabrosek, J. G., Cur tiss, P ., and Malone, C. (2014). Inv estig ating student understanding of histograms. Jour nal of Statis tics Education , 22(2). Kar pierz, K. and W olfman, S. A. (2014). Misconceptions and concept inv entory ques tions f or binar y search trees and hash tables. In Proceedings of the 45th A CM T echnical Symposium on Computer Science Education , SIGCSE ’14, page 109–114, Ne w Y ork, NY , US A. Association f or Computing Machinery . K onold, C. (1989). Inf or mal conceptions of probability . Cognition and instruction , 6(1):59–98. Lane-Getaz, S. J. (2007). Dev elopment and V alidation of a Researc h-based Assessments: Reasoning about P -values and Statistical Signiﬁcance . PhD thesis, Univ ersity of Minnesota. 26 Leighton, J. P . (2013). Item diﬃculty and inter vie wer kno wledg e eﬀects on the accuracy and consistency of ex aminee response processes in v erbal repor ts. Applied Measurement in Education , 26(2):136–157. Leighton, J. P . (2017). Using Think -Aloud Inter view s and Cognitive Labs in Educational Resear c h . Oxf ord U niversity Press. Leighton, J. P . (2021). Rethinking think -alouds: The often-problematic collection of response process data. Applied Measurement in Education , 34(1):61–74. Lipson, K. (2002). The role of computer based technology in dev eloping understanding of the concept of sampling distribution. In Proceedings of the Sixth International Conf erence on T eaching Statis tics . Liu, P . and Li, L. (2015). An ov er view of metacognitive a wareness and l2 reading strategies. In W egerif, R., Li, L., and Kaufman, J. C., editors, The Routledg e International Handbook of Resear ch on T eaching Thinking , chapter 22, pages 290–303. R outledge. Lo vett, M. (2001). A collaborativ e conv erg ence on studying reasoning processes: A case study in statis tics. In Car v er, S. M. and Klahr , D., editors, Cognition and Instruction: T wenty-ﬁv e Y ears of Pr ogr ess , chapter 11, pag es 347–384. Lawrence Erlbaum Associates Publishers. Lübke, K., Gehrke, M., Horst, J., and Szepannek, G. (2020). Wh y we should teach causal inf erence: Examples in linear reg ression with simulated data. Journal of Statis tics Education , 28(2):133–139. McGinness, L. P . and Sav age, C. M. (2016). De veloping an action concept inv entory . Phy sical Review Physics Education Resear ch , 12:010133. Me yer , M., Orellana, J., and Reinhart, A. (2020). Using cognitiv e task analy sis to unco ver misconceptions in statistical inf erence courses. url: https://www.causeweb.org/cause/ ecots/ecots20/posters/2- 02 . Pos ter presented at: Electronic Conf erence On T eaching Statis tics (eCOTS). Ne wman, D. L., Snyder , C. W ., Fisk, J. N., and W right, L. K. (2016). Dev elopment of the central dogma concept inv entor y (CDCI) assessment tool. CBE—Lif e Sciences Education , 15(2). Nielsen, J. and Landauer , T . K. (1993). A mathematical model of the ﬁnding of usability problems. In Pr oceedings of the INTERA CT ’93 and CHI ’93 Conf erence on Human F actors in Computing Sys tems , CHI ’93, page 206–213, Ne w Y ork, NY , USA. Association f or Computing Machinery . Noll, J. and Hancock, S. (2015). Proper and paradigmatic meton ymy as a lens f or character izing student conceptions of distr ibutions and sampling. Educational Studies in Mathematics , 88(3):361– 383. 27 Nørg aard, M. and Hor nbæk, K. (2006). What do usability ev aluators do in practice?: an explorativ e study of think -aloud testing. In Proceedings of the 6th Conf erence on Designing Inter active Sys tems , pag es 209–218. Park, J. (2012). Dev eloping and validating an instrument to measur e colleg e students’ inf erential r easoning in statistics: an ar gument-based appr oach to v alidation . PhD thesis, Univ ersity of Minnesota. Pf annk uch, M., Budgett, S., and Arnold, P . (2015). Exper iment-to-causation inf erence: U nderstanding causality in a probabilistic setting. In Zieﬄer , A. and Fr y , E., editors, Reasoning about Uncertainty : Learning and T eac hing Informal Inf erential Reasoning , chapter 4, pag es 95–127. Cataly st Press. P or ter , L., Zingaro, D., Liao, S. N., T a ylor , C., W ebb, K. C., Lee, C., and Clancy , M. (2019). BDSI: A v alidated concept inv entory for basic data structures. In Proceedings of the 2019 A CM Confer ence on International Computing Education Resear ch , ICER ’19, pages 111–119, Ne w Y ork, NY , USA. Association f or Computing Machinery . Pressle y , M. and Aﬄerbach, P . (1995). V erbal pro tocols of reading: The natur e of constructiv ely r esponsive reading . R outledge. R ober ts, V . L. and Fels, D. I. (2006). Methods f or inclusion: Emplo ying think aloud protocols in softw are usability studies with individuals who are deaf. International Journal of Human-Computer Studies , 64(6):489–501. Sabbag, A. (2016). Examining The Relationship Betw een Statis tical Liter acy And Statis tical Reasoning . PhD thesis, Univ ersity of Minnesota. Sa wilow sky , S. S. (2004). T eaching random assignment: do you believ e it w orks? Journal of Modern Applied Statistical Methods , 3(1):221–226. T a y lor, C., Clancy , M., W ebb, K. C., Zingaro, D., Lee, C., and P or ter , L. (2020). The practical details of building a cs concept in ventory . In Pr oceedings of the 51st A CM T echnical Symposium on Computer Science Education , SIGCSE ’20, page 372–378, Ne w Y ork, NY , US A. Association f or Computing Machinery . Theobold, A. S. (2021). Oral e xams: A more meaningful assessment of students ’ understanding. Journal of Statistics and Data Science Education , 29(2):156–159. Williams, A. M. (1999). No vice students ’ conceptual kno w ledge of statistical hypothesis testing. In T ruran, J. M. and T ruran, K. M., editors, Making the diﬀerence: Pr oceedings of the T w enty-second Annual Conf erence of the Mathematics Education Resear ch Group of Austr alasia , pages 554–560. A delaide, South Aus tralia: MER GA. Willis, G. B. (2005). Cognitiv e Inter viewing . SA GE Publications. 28 W oodard, V . and Lee, H. (2021). How students use statistical computing in problem solving. Jour nal of Statis tics and Data Science Education , 29(sup1):S145–S156. W ren, D. and Barbera, J. (2013). Gather ing evidence f or validity dur ing the design, de velopment, and qualitativ e ev aluation of ther mochemis tr y concept in ventory items. Journal of Chemical Education , 90(2):1590–1601. Ziegler , L. A. (2014). R econceptualizing statis tical liter acy : Dev eloping an assessment for the modern introductory statistics course . PhD thesis, Univ ersity of Minnesota. 29

Think-aloud interviews: A tool for exploring student statistical reasoning

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment