Statistics, ethics, and probiotica

A randomized clinical trial comparing an experimental new treatment to a standard therapy for a life-threatening medical condition should be stopped early on ethical grounds, in either of the following scenarios: (1) it has become overwhelmingly clea…

Authors: Richard D. Gill

Statistics, ethics, and probiotica ∗ Ric hard D. Gil l Mathemat ical Institute Leiden Uni v ersity h ttp://www.math.leiden univ.nl/ ∼ gill; gill@math.leiden univ.nl 4 Septemb er 2008 Abstract Ethical issues in v olv ed in the design of the “PR OP A TRIA” probiotica trial are discussed. This randomized clinical trial app eared to b e well conducted according to accepted go o d practices. The findin g that the treatmen t was actually rather h arm ful, and that despite this, and despite a built-in in terim analysis, the trial w as not stopp ed earlier, led to strong criticism in the media. I argue that “accepted goo d practices” need to b e r econsidered in the ligh t of this exp erience. Firstly , a muc h stronger distinction needs to b e recognized b et wee n the immediate inte rests of the p atien ts b eing treated in the trial, and the interests of fu tu re patients of futu re do ctors elsewhere. Secondly , it is in the interests of fu ture patien ts that w ell cond ucted clinical trials are accepted by so ciet y . Since it is u na voidable that an o ccasional trial w ill result in an unpr edicted sev erely negativ e outcome, ethical screening committees must ensu re that those p erforming a trial can nev er b e accused of putting th e in terest of “science” ab ov e the in terest of their o wn p atients when suc h “acciden ts” happ en. There are t wo consequences of this. Firstly , the design of a trial should also explicitly min imize the n umb er of patien ts who are treated b y the researc hers with a p oten tially seriously h armful medicine. S ec- ondly , the disadv an tages of trip le-blinding far outw eigh th e adv an- tages. Though it m igh t at b est only ha v e sa ved a few liv es if the PR OP A TRIA trial b een r e-designed with these iss u es in mind, I argue that the scien tific v alue of the trial would not ha ve b een significan tly ∗ Statistic a Ne erlandic a (20 09) V ol. 63, nr. 1 , pp. 1–12 1 reduced; the damage to medical researc h , and hence to futu r e patien ts, w ould hav e b een sub stan tially less. Closer insp ection of the data from the PR OP A TRIA trial brings a new and quite un exp ected failing to ligh t. The decision for early stopping the trial was acciden tally based on the one-sided test lo oking in the wrong dir ection, p artly through inad equ acy of th e output of the statistica l p ac k age SPSS , partly through lac k of statistical exp ertise on the part of the users. If th e en visaged one-sided stopping ru le had b een used correctly , the trial w ould in fact ha ve b een terminated at the time of the in terim analysis “ for futilit y”: it w as at this m oment highly u nlik ely that a significan t end -result in fav our of probiotica w as going to b e attained. The decision to con tinue the trial was a resu lt of lo oking at the test statistic “in the wr on g direction”. In effect, the trial w as con tinued b ecause there was still a go o d c hance to show that probiotica is actually ve ry harmful. I recommend that data monitoring committe es sh ould alw a ys b e advised b y a professional stat istician and that this p erson is not blinded to the treatmen t allo cation. 1 The Design The recen tly concluded PROP A TRIA study of the use of probio tics in the treatmen t of acute pancreatitis (Bes selink et al., 200 4, 2 008) receiv ed m uch media atten tion in the Netherlands when it w as rev ealed at the clos e of the study that the treatmen t app eared to hav e a negativ e effect. It did not, as exp ected, decrease the c hance of infectious complications, whic h remained at around 1 in 3. More seriously , among those that did dev elop infectious com- plications, the fr equency of a f a tal outcome w a s roughly doubled (roughly , from 1 in 4 to 1 in 2). F or ethical r easons the m ulti-cen tre, doubled-blinded, randomized trial had b een monitored ov er the the three ye ars of its duration, and an in terim analysis p erformed. The interim analysis was further semi-blinded in the sense tha t the monitor ing committee did not kno w whic h group w as the treatmen t group, which group w as t he con trol. Though, at the final conclu- sion of the tr ial, there had b een a t o tal o f 33 deaths in t he total group of 296 patien ts considered in the final analysis, only t wo of these deaths ha d b een considered b y do cto r s in the participating hospitals as so unexp ected and serious at the time, that they were rep orted individually to the mo nit o ring committee. The treatmen t group was rev ealed to t he monitoring committee for those t w o cases only . In b oth cases the pat ien t was in the tr eatmen t group; and in b oth cases the death w as due to the same rather unusual com- 2 plication whic h – only retro sp ectiv ely – turned out to b e t he cause of a large n umber of the probiotica deaths. The researc h team conducting the trial certainly strongly b eliev ed a t the outset that probio t ica likely w ould ha ve a go o d effect on their patients. They had theoretical arg uments supp orting that opinion, and also empirical evi- dence p ointing in the g o o d direction (but there ha v e also b een negativ e re- p orts). Moreo ver, the treatmen t consists of a co cktail of bacteria whic h, among health y p eople, are normal and necessary residen ts of our digestiv e tract, and whic h a re presen t in p o pular and commercially av ailable fo o d ad- ditiv es. N ew strains had b een dev elop ed whic h we re exp ected to stim ulat e imm une resp onse, and to comp ete with the “bad” bacteria asso ciated with the infectious complications whic h are the usual cause of mortality in pan- creatitis. The researc hers considered that there w ere go o d reasons to carry out a large scale tria l to settle the issue, and did not exp ect an y negativ e consequenc es of the treatment. In retrosp ect, the triple blinding of the study is obviously an em barras- men t to the already seriously disapp ointed and concerned r esearch ers. This is esp ecially the case since the tria l w as originally planned to include only 20 0 patien ts, not 300 1 . A t the in terim analysis of the first 100, the ov erall rate of infectious complication w as lo w er t ha n exp ected. The monitoring committee made the recomendation to increase the total in ta ke to 3 0 0 patients “in order to preserv e statistical p ow er”. This is an in terven tion o n b ehalf of science, not on b ehalf of the patien ts in the tr ial, a s I will further argue b elo w. The formal in terim analysis was p ostp oned to the new, exp ected, half- w ay mark, but actually only to ok place when the tr eat ment of a total of 184 patien ts had b een completed. The mortality of the t w o groups seemed differen t but the difference did not quite reac h significance at the 1 0% leve l. The moni- toring committee asso ciated this with a sligh t difference in the initial health state of the t w o groups; and the monitoring committee did not know whic h group was whic h. Presumably , if an ything t hey could imagine t hat the tria l w as w orking out – to the adv antage of the treatmen t – as the researc hers had initially expected. An early stopping rule base d on the “primary endp oin t” of the trial, infectious complications, seemed to indicate that the tria l should con tinue. The statistical design w as based on mostly standar d recipes distilled from 1 Actually: 298 patien ts e n tered the trial, but in retrosp ect, it appea red 2 ha d been wrongly diag nosed and are r emov ed fro m the ana lysis. T he fina l rep ort says that the monitoring committee advis ed increa s ing to 296. I w o nder if the monito rs re a lly wro te this num b er, or if it was a slip of the pen of the final rep orter, tw o years later . Whether this is exactly the num be r which came out of a ca lculation or not, I r eplace it her e by a round num b er 3 a standard w ork – at least, lo cally , in the Netherlands — Sc houten (1999 ). That author highly recommends the in terim analysis advised b y Snapinn (1992). A small p oll o f prof essors of medical statistics in the Netherlands rev ealed that despite Sc houten’s endorsemen t, this metho d is more or less forgotten. In ternationally the metho ds of choice seem to b e those of P o co c k, and of Fleming and O’Brien. How ev er T h e L anc et follows Sir Ric hard Pe to’s m uch mo r e radical advice only to act on in terim analysis significance tests if they reac h the 1 in 1000 significance lev el: in other w ords, almost neve r. There is a go o d philosoph y b ehind this recomme ndation, but it is based on assumptions, and the question is a lw ays , whether those assumptions ough t to ha v e b een ma de in this case, ev en if t hey might b e usually uncon trov ersial. T riple blinding is ve ry con trov ersial, and is m uc h discussed in the litera- ture. The FD A extensiv ely discuss the pros and cons in their guidelines fo r clinical trials in drug testing, and advises aga inst it. A Dut ch epidemiologist V anden brouc ke (1999) endorses it strongly in a three page article in the Ne d- erlands Tij d schrift vo or d e Gene eskunde . He give s go o d reasons f o r it, but there are also reasons against it , and the question is, what is most relev an t for this sp ecific case. Normal practice in the US and the UK is that ev en if the data monito ring committee is blinded, the statistician who is advising them is not blinded. He or she is able to in t ervene if the committee seems to b e making a decision whic h they w o uld regret if the iden tity of the t w o groups w ere exc hanged. Normal practice is tha t the data monitor ing com- mittee explicitly pla ys through b oth scenarios in their delib erations: group A is treatmen t, B is control; and vice ve rsa. If their decision under b oth scenarios is the same the data do es not need to b e de -blinded. In the pap er Besselink et al. (200 4 ) desc ribing the design o f the trial in BMC Sur gery , the researc hers write This study is conducted in accordance with the princip les of the Declaration o f Helsinki and ‘go o d clinical practice’. F or ethical reasons it is desirable t o end a therap eutic exp eri- men t once a statistical signific an t difference in treatmen t results has b een reac hed. This study uses the stopping-rules according to Snapinn (1992). An in terim-analysis will be p erformed after the data of the first 100 patien ts (50% fraction) is obtained. Accord- ing to Snappin, the trial will b e ended at this in terim-analysis at p < 0 . 0081. The study will also b e ended in case of adv erse ev en ts without p ossibilit y o f p ositiv e outcome, p > 0 . 382. The monitor- ing committee will discuss the results o f the interim-analysis and advice the steering committee. The steering committee decides on the contin uation of the trial. 4 No w, t here are certainly ethical reasons to conclude a tria l as so on as p ossible. Sc houten (1999) g ives a list o f fo ur ethical concerns (a nd more can b e added): 1. Is it ethical to giv e a placeb o treatmen t to a seriously ill p erson? 2. Is it ethical to g ive an untes ted new medicine, whic h might hav e serious side effects, to a seriously ill p erson? 3. Is it ethical to decide tr eat ment for a seriously ill patien t, with the in terests of science in mind, o n the basis of to ssing a coin? 4. Is it ethical still not to know , 10 or 2 0 years f r o m now, what is the b est treatmen t for a seriously ill patient? The p oin t I w ant to mak e is that differ ent ethic al c onc ern s ar e in c onflict with one another. Any p r op ose d “ s o lution ” implictly weig h ts the diffe r ent c onc erns in a p articular way . Let me first sk etc h the global design of the trial. It w a s set up to hav e p o w er 0.80 w hen testing the n ull h yp othesis of no treatmen t effect, t wo sid- edly , and w ith significance level 0.05, against the alternativ e that the treat- men t roughly halv ed the probability of the “primary endp oint”, infectious complications. This means that the primary constrain t on the researc hers is to guard against the publication of “fa lse po sitiv es”. They m ustn’t run a higher risk than 1 in 20 of this. And journals whic h publish the results of medical researc h, insist on the same “filtering” of s ignal fro m noise. Since the p ow er against the a ctually exp ected effect is 0.8 0, they are prepared to run a risk of 1 in 5 that the treatmen t cannot b e prov ed effectiv e, ev en if it is exactly as effectiv e as t hey b eliev e. It is p erhaps hard to b eliev e that one in v ests so m uc h researc h effort with a 1 in 5 chance that it is all w asted, but t his seems to b e standard practice. On the other hand, one could sa y that taking suc h a lo w sample size is actually protecting patien ts, in the case that the new treatmen t turns out t o b e bad f o r them. Whose ethical concerns are addressed b y these c hoices? Primarily , t he concerns of science and of future patients of future do cto r s. W e don’t w ant to tell t he w o rld that probiotica is fan tastic, when actually it do es nothing . (And The L anc et w on’t let us do this either). There is an unpleasan t side effect here: if probiotica actually do ubles the risk of infectious outcomes, instead of halving it, w e also run a r isk o f 1 in 5 of no t not icing that , and just getting a non- significan t result. A t least, a false-p o sitiv e result do es not get publishe d in that case either. W ould T he L a nc et ha v e published a non- significan t result an yw ay ? It is surely bad to tell the w orld that probiot ica mak es no difference, when it actually doubles y our risk. 5 No w let us o ve rla y these, already ethical choice s, with the ch oice implied b y Snapinn’s metho d. The reason Sc houten is so en thus iastic about Snapinn is that it do es not alter the ev aluation of the final results when the tria l is not stopp ed half- w ay . Th us the nominal size α = 0 . 05 is maintained, ev en though there is some chance that the sample size is half what it apparen tly should hav e b een. There ar e no free lunc hes. This means that e ither the p o wer is reduc ed, or the ch ance to stop early hardly exists. W ell, w e hav e to b e a bit more careful, since the c hance of early stopping dep ends o n “the truth”. Snapinn is conserv ativ e, and do es not wan t to lose m uch p ow er. There is some chance of early stopping if the null h yp othesis is true, but almost no c hance of early stopping if the alternativ e h yp o t hesis is true. Th us t he p o w er is only slightly r educed. Sc houten has adapted Snapinn’s design, and accepts a m uch lar g er loss of p ow er in order to increase the c hances of early stopping, under b oth h yp o theses. Whose ethical concerns are addressed by use of this early stopping rule? Clearly it is go o d to stop early if probiotica do esn’t do a nything at all (though whether The L an c e t w ould still publish if the tria l is ab orted halfw ay b ecause nothing is exp ected to come of it, is a n in teresting question). What ab out the issues of in t erest to the patien ts actually en tering the trial? If a new treatmen t has unpleasant side effects but otherwise mak es no difference, then the patien ts who w ould normally enter the trial later, w ould appreciate not receiving it. It’s a bit strange to call the “secondary outcome” death, merely an unpleasan t side effect. And ho w big is the chance of stopping early , if the treatmen t has no effect at all? The answ er is easy to read off, from the actual implemen tation of the Snapinn r ule: the researc hers planned to stop t he trial half- w ay , if the p -v alue at that time w as larger tha n 0.382. This means that t here w as a 62% c hance that the trial w ould b e stopp ed half-w ay if actually the treatmen t did not reduce infectious complications (as indeed seem s to be the c ase). If the treatmen t has a negativ e effect, the probability o f stopping early “for futilit y” rapidly gets larger. So at this p oint t he trial do es ha v e a built- in safet y measure. Whose ethical concerns are addressed by adopting V andenbrouc k e’s ad- vice to triple blind? This mak es it ev en less lik ely fo r a monitoring committee to “break t he rules ” b y stopping the tr ia l early when it is going in an unex- p ected, negativ e direction. The p oint is that future patien ts of future do ctors ha ve an in terest in trials b eing completed to full term, otherwise results are biased. A trial whic h is stopp ed early b ecause of an apparent negativ e effect is probably stopp ed when the observ ed (negativ e) effect is worse than the 6 actual one. In conclus ion: ethical safeguards w ere built in to the s tatistical design of the trial, but they are a v ariety safet y measures for differen t, conflicting, eth- ical issues . So me safety measures actually increase the ethical dangers in t he situation whic h b y the admission o f the researc hers themselv es, most lik ely obtained for the pr o biotica t r ial – namely testing a treatmen t whic h turned out t o b e harmful for their patien ts. F or t his ev en tuality , just one statistical safet y measure w as built in – the p ossibility of “stopping for futilit y” in the Snapinn plan. 2 The Results In their 2008 publication in Th e L anc et at the close of the trial, the re- searc hers write W e calculated that 200 patien ts with predicted sev ere acute pan- creatitis w ould b e r equired to detect a 20% reduction in the abso- lute risk of the o ccurrence of infectious complications (from 5 0% to 30% of patien t s during admission and 90- da y f o llo w-up) for the study to attain an 80% statistical pow er, at a t wo-sided α of 0.05. This sample size calculation to ok into a ccoun t the fact that up to 40% of patien ts with predicted sev ere pancreatitis are ulti- mately dia g nosed with mild pancreatitis (i.e., no lo cal or systemic complications) and th us do not progress to sev ere or necrotising pancreatitis. After the first 10 0 patien ts w ere randomised and had completed follo w-up, the n um b er of infectious complications w as calculated in the total gro up. The rate of infectious complications w a s low er than exp ected (28%), so the monitoring committee advised increasing the to- tal sample size fro m 200 to 296 patients to maintain statistical p o w er. After 1 84 patien ts had b een ra ndomised and had com- pleted f o llo w- up, a blinded in terim analysis w as done for the pri- mary endp oint and mortalit y . Although a non-signicant differ- ence in mortality was observ ed ( p = 0 . 10), the monitoring com- mittee concluded that this had b een caused b y sk ew ed ra ndo mi- sation b ecause more patien ts in the group w ith higher mortality required admission to in tensiv e care within 7 2 hours after admis- sion ( p = 0 . 15), whereas the o verall mortality w as w ell within the exp ected range (11%). According to the predefined stopping rule 7 the monitoring committee recommended that the study should b e completed. During the study , tw o serious adv erse ev ents w ere rep orted; b oth patien ts died. The monito ring committee conv ened on b oth o c- casions: in one patient, a ruptured caecum with isc haemia w as found during emergency laparotom y and the second patient had small-b ow el isc haemia diagnosed at emergency laparoto my . In b oth cases, the randomisation co de w as bro k en (b o t h patien ts had receiv ed pro biotics). This infor ma t io n w as rev ealed only to mem b ers of the monitoring and steering committees. A review of published work did not rev eal any evidence of a relation b e- t wee n b o wel isc haemia and the use of probiotics. The monitoring committee subsequen tly advised that the study contin ue. The institutional rev iew b oa r d w as infor med on b oth o ccasions. I wan t to draw attention to t w o things: firstly , the advice of the moni- toring committee to increase the sample s ize in order to maintain statistical p o w er. Actually , this p o ssibility w as en visaged in the original pro t o col of the trial, for the follow ing reason. Patien ts had to b e en tered in to the trial on t he basis of pr e d i c te d sev ere acute pancreatitis. It tak es another w eek or more before a more certain diagnosis can b e made, but it was imp ort a n t to start t he treatment straigh t aw a y . It could b e that many patients w ere b eing admitted who in retrosp ect only had mild a cute pa ncreatitis. In that case, a bigger s ample size w o uld be neede d to see t he s ame effect on those patien ts with sev ere acute pancreatitis. This means that the data monitoring committee w as authorized to in ter- v ene in the design of the trial, for ethical reasons concerne d with the treat- men t o f future patien t s o f future do ctors, not with the treatmen t of their o wn pat ien ts in their own trial. They w ere authorized to make decisions ab out the future tr eatmen t of patients ab out to en ter the t rial, based on the results o f t he trial so far, without know ing whic h group was the treatmen t group and which g roup w as the control group. It seems that the monitor- ing committee dealt with this problem by lo oking at the ag gregate data of the t w o treatmen t groups. When they did this , they sa w a m uch low er rate of infec tious complications ov erall, than had b een expected in adv ance, and hence concluded that there were more patients with only mild pancreatitis in the trial than planned. Sim ultaneously the monito ring committee sa w the same o verall mortalit y rate as ha d b een expected in adv ance! It seems to me that an alarm bell migh t hav e gone off here. The monitoring committee do es not know that the excess deaths – not statistically significant, to b e sure – are o ccurring in the treatmen t group! Despite the conflicting information, 8 and blinded to the identit y of the tw o g roups, they prop osed increasing the sample size. Secondly , the a pplication of Snapinn’s rule t alks ab out adv erse ev en t s. Y et o nly tw o adve rse ev en ts w ere ev er in v estigated by the monitoring com- mittee. Only tw o adv erse eve n ts w ere “serious”. Both t w o serious adve rse ev en ts w ere connected to the same complication. A literature searc h did not connect t his kind of ev en t to probiotica. What if the monitoring committee had kno wn that already half of the man y deaths in the pro biotica group w ere of this same rar e kind? P erhaps the literature search w ould hav e b een ex- tended with consultation with exp erts from other fields. It is no w easy , af t er the ev ents, to find microbiolog ists w ho say in effect “I told you so” . I think it could hav e b een go o d if the monitoring committee ha d talk ed to these microbiologists already half-w ay through the tria l. If the trial had not b een triple-blinded, the monitoring committee might no w ha v e pulled the plug on it. Ob viously (if the effect whic h has b een fo und is real), this would ha v e sa v ed some liv es of patients in the trial. W ould science, would future patients of future do ctors, hav e suffered? In retrosp ect there are plausible medical explanations for the “new” phe- nomenon. When y o ur immune system is at breaking point, “stim ulating” it with f riendly bacteria is not a go o d idea. When t he barriers b et we en differen t organs are breaking do wn and evil, agressiv e, bacteria are mo ving freely from one place to anot her, adding new streams of migrants, ho wev er useful they migh t b e in y our gut, only makes things worse in places where they ough t not to b e. It seems to this non- medical p erson, that if half-wa y the researc hers had seen what was the sp ecial complication whic h w as killing the alr eady seriously ill patien ts in the treatmen t group, they could just as we ll hav e come up with this theory half- w ay already . It seem s to this non- medical p erson, that if the trial had b een stopp ed half-w ay , the substan tive conclusions of the trial whic h no w app ear in The L anc et could also hav e b een reached and could also ha ve b een told to the w orld, one w a y or another. An ywa y , if t he monitoring committee found other microbiologists with differen t ideas, they c ould ha ve only temp ora rily stopp ed the trial. If it w as medically tr uly b eliev ed to b e a false alarm, the trial could ha v e b een recon tin ued after a break. My p oin t is, that at least they could, ev en if only in retrosp ect, ha v e pro v ed that they did alwa ys hav e the interes ts of their own patien ts in mind, not just the in terests o f the future patients o f future do cto r s. This w ould ha ve made life for them eas ier now; and it w ould be to the b enefit of future patien ts and of science, since clinical trials can only b e do ne if the public is confiden t that their do ctors alw a ys think in the first place of their ow n patien ts. 9 3 The stopping rule As noted ab o ve, the PROP A TRIA team rep ort that the Snapinn stopping rule indicated that t he trial should contin ue at the new (p ostp oned) interim analysis, which actually to o k place not half- wa y through the mo dified trial at 150, but a bit later a t 184 patients . It turns out that this conclusion w as based on a mis-reading of the output of a statistical pack age, a nd that according to t he proto col of the experiment the trial should ha v e b een stopp ed a t that momen t. In order to explain ho w this happ ened I need to discuss the stopping rule in a little more detail, and first of a ll to rep eat the principles on whic h it is constructed. The idea of the Snapinn rule is that a randomized clinical trial comparing an experimen tal new treatmen t to a standard therap y for a life-threatening medical condition should b e stopp ed early o n ethical g r o unds, in either of the following situations: (1 ) it has b e c om e ove rw helmingly cle ar that the new tr e atmen t is b etter than the standar d ; (2) it has b e c o m e ove rwhelmingly cle ar that the trial is not going to sho w that the new tr e atment is any b etter than the standar d . These tw o situations are called “stopping f or significance” and “stopping for futility”, resp ective ly . T he tr ia l is con tinue d in the third scenario: (3) ther e is a r e asona ble c hanc e that the new tr e a tment wil l final ly turn out to b e b etter than the standar d, but we ar en ’t sur e yet . An explicit p ossibilit y of stopping fo r futilit y pro vides an implic it safet y measure: if the new treatmen t is actually har mful to the patien ts, w e w o uld w ant the trial to stop as early as p ossible. But this situation would also tend to result in data suc h t ha t the trial is stopp ed early “ f or futilit y”. The PR OP A TRIA team used a n adaptation due to Schouten (1999, stan- dard Dutc h t extb o ok Klinische Statistiek ) of the early-stopping- rule of Snap- inn (1992, Statistics in Me d icine ). The k ey feature of this early-stopping rule is that it is based on the p - v alue of the statistic o f interest, a t the time of the interim ana lysis. “Time” is expresse d as the fr a ction of the originally planned sample size, at which o ne is p erforming the inte rim analysis. One should simply compare the in terim p -v alue to t w o critical v alues, o ne for stopping for signfic ance, the other for stopping for futilit y . The critical v al- ues a re determined from the ov er-all in tended significance lev el and p o w er, and the interim sample fraction (time). Snapinn’s pro cedure is carefully de- signed such that the ov erall significance lev el of the trial with early stopping allo w ed is the same as the significance lev el of the tr ia l with fixed sample size. Moreo ve r, if the t rial is not stopped early , the statistical analysis at the end of the trial is the same as if early stopping had not b een incorp ora ted in the des ign at all. Because the tria l might stop early and the significance lev el is unaltered, some p ow er is lost. Snapinn, and following him Sc houten, 10 ha ve tuned the thresholds for early stopping in a compromise b et w een loss of p o w er and ch ance of stopping early . Sc houten allo ws a bigger loss of pow er, hence increases the c hance of stopping early . As just men tio ned, t he data monitoring committee was blinded to the iden tity o f treatmen t groups A and B. This means that they needed to com- pute the one-sided p - v alue for testing the n ull h yp othesis of no treatmen t effect against the alternativ e of a b eneficial treatment effect, with b oth as- signmen ts of “t r eat ment” and “con tro l” to g roups A and B. The outcome is binary (infectious complications or not) a nd the res earc hers used the Fishe r exact test for a 2 × 2 con tingency table. In that con text, t he statistical pack- age SPSS do es not allow the user to sp ecify whic h o ne- sided alternative is of in terest, but rep o rts the p -v alue for the one-sided test which is more signifi- can t; i.e., that of the alternativ e suggested p ost-ho c b y the data. Apparen t ly , the committee did not realise what was g o ing on. The p - v alue deliv ered b y SPSS did not dep end on the lab elling of the t w o groups. But what w as it? Though the PR OP A TRIA researc hers dec lined to pro - vide the data from the interim analyses to in terested scien tists, t hey did ac- ciden tally provide some data t o intere sted journalists at a press-conference. The t wo groups ar e lab elled there a s group A and gro up B. F rom the snip- p ets of information ab o ut the in terim a nalysis a v a ila ble in the Lancet pap er, w e can determine that g roup A is the treat ment group, and group B is the placeb o or control group. T o my great surprise it turns out that at the in- terim analysis, the rate of infectious complications (the primary endp oin t) in gro up A exceeded that in group B b y an absolute amount of 5 %. Nor- malizing with an estimate of the of the standar d deviation of the difference, yields a z - v alue of close to +1. Th us the o ne-sided p - v alue for the alternativ e that probiotica is b ad for y ou is ab out 16%; the one-sided p -v alue f or the alternativ e that probiotica is go o d for you is ab out 84%. The Fisher exact test giv es similar p -v a lues of 21% and 87% resp ectiv ely . The data monitoring committee obtained f rom SPSS, for b oth cases, the smaller p -v alue of 21%. Comparison with t he critical v alues fr om Snapinn- Sc houten leads to the advic e “con tinue” in both cases. There is no ne ed to de-blind the data. Ho w ev er the appropr ia te p -v alue w as 87% and the prop er conclusion w as to stop the experimen t for futilit y . In order to obtain the correct decision it w ould hav e b een necessary to de-blind. The data monitor ing committee w o uld then not only hav e receiv ed a signal from the Snapinn rule that it w as p oin tless to con tin ue the trial, t he rate of infectious complications in the treatment group w as actually larger than in the con trol group and there w as a lmost no c hance that this could rev erse b y the end of the trial; they would a lso ha v e s een that the mortalit y w as also m uch larger in the tr eat ment group than in the c on trol gro up. 11 The fact that the researc hers did not realise that an ything w a s wrong indicates some lac k of understanding of the principles behind the statistical metho ds they we re using. There are some ot her indications of inadequate understanding, though to b e f air, b oth Snapinn and Sc houten-on-Snapinn are difficult r eading. At the same time, it is v ery difficult to trace exactly what they did do: the publicatio ns of the pro bio tica gr oup alw a ys refer to Snapinn (19 92) without f urt her sp ecification. Snapinn giv es three versions of this stopping rules, Sc houten gives t w o more. The critical v alues quoted b y the researc hers cannot b e found in Snapinn’s pap er at all. No w, Schouten made sev eral inno v a tions. He allo ws a greater c hance of early stopping, at the cost o f decrease of p ow er. While Snapinn throughout works with a one- sided significance lev el of 2.5%, Schouten uses throughout 5%. There is a go o d reason for Snapinn’s c hoice. He w an ts to gra ft his carefully constructed asymmetric early stopping rule on to a fixed sample size final ev aluation with the con v entional tw o-sided significance leve l of 5%. Ma yb e in a n instinctiv e correction for Schouten’s inappropria te doubling of the significance lev el, the probiotica researc hers to ok critical v alues correspo nding to halving the error of the sec ond k ind: so together, they got their critical v alues from the table for one-sided 5% significance lev el and p o w er 90%, instead of the “trial design parameters” one- sided 2.5 % signficance lev el and p ow er 80%. There is one more mismatc h: the interim a nalysis had b een planne d at a sample fra ction of 50%, but in fact only to ok place at 6 0%; they should ha v e g o ne back to the tables to find the critical v alues at 6 0% rather than using tho se for 50%. How ev er, none of these inaccuracies in the implemen tation aff ect the conclusion whic h should hav e follo w ed: stop for futility . 4 Optimal group seque n tial des igns Jennison and T urn bull (2002) pro vide metho dolog y for determining group sequen tial plans whic h minimiz e the exp ected sample size for giv en errors of the tw o kinds. In particular one can design a plan with the same errors of the first a nd second kinds a s the PROP A TRIA trial, and whic h minimizes the expected sample size when the actual effect of probiotica is the opp osite to that expected a doubling instead of a halving of the rat e of infectious complications. It turns out that under that negativ e scenario, the exp ected sample size is 15% of the fixed sample sample size, with the same size and p o w er. Ho wev er this result is not directly relev ant t o the PROP A TRIA trial since the rate of infectious complications, the primary endpo in t of the trial, w as hardly affected by the treatmen t . Since infectious complications are the 12 ma jor cause of death in acute pancreatitis, and since deaths o f some other causes also turned out to b e increased b y the treatmen t, in retrosp ect it w ould hav e b een wise to tak e the primary endp oin t of the trial as death from the disease. No w, the exp ected death rate w a s 1 0%, and the researc hers presumably ex p ected this to o to be halv ed b y their treatmen t, while in fact it w as doubled. Because of the lo wer ra t e, the fixed sample size required for tw o-sided size 5% and p ow er 80% b ecomes larger. Still, the final result is that a group sequen tial plan designed for early stopping in the case of a negativ e effect of the treatmen t, w ould hav e led to this trial b eing stopp ed at ab o ut 100 patien ts. 5 Conclus ions Medical researc hers following their h unches and anxious to pro ve they hav e found a fan tastic new wa y to cure patients , pull sophisticated statistical to ols out of the draw er. Using these sophisticated to ols helps p ersuade ethical screening committees to bac k them. The standard me tho ds in the standard textb o oks a lready mak e ethical assumptions. Routine application of these metho ds means one is routinely making those same ethical a ssumptions. But I suspect that no-o ne realises what those ethical assumptions are. And no - one realises that addressing one ethical concern, migh t expose you more seriously to another. In principle mathematical statisticians can figure out the solutions to t he more complicated optimization pro blems whic h arise when y ou try to tak e accoun t of more, and conflicting, concerns at the same time. A t least, we can bring these out in to the op en so eve ry one kno ws what is the cost to “buying” one of the standard solutions. W e could ha ve designed the tria l so that it minimized the exp ected n um- b er of patien ts en tering the trial if the probiotica doubles the risk, sub j ect to the same size and p ow er. It w ould hav e b een a completely differen t design. P ossibly it w ould not ha ve sav ed man y liv es, p ossibly it w as infeasible. But I think it is imp or t a n t to kno w whether the researc hers could easily hav e sav ed man y liv es, or if they w ere already do ing close to what is b est, e v en though they had not primarily concerned themse lv es with protecting their patien ts from a p o ssibly dangerous medicine. Admittedly a serious complication to any mathematical statistical tr eat- men t, is that in this case the treatmen t had increased the c hance of a certain serious side effect. T o b e precise, death. So m y main recommendation is not to do some difficult mathematics, but to r eappraise t he ethics o f “triple blind”. Esp ecially on to p of a ll the other actually rather one-sided ethical concern em b o died in the usual c hoices of null and alternativ e, size and p ow er. 13 The choice of the Snapinn proto col is one of the few places where the trial has a built in safet y feature: if the treatmen t is w orking badly , there is some c hance that “stopping for futilit y” will be triggered. It is especially tragic that this safet y feature failed to w ork through a misreading of output of a statistical pac k age. My conclusion is that ethic a l scr e ening committees ne ed more statistical exp ertise in order to judge whic h ethical concerns are b eing take n accoun t of, when someone else’s routine and tec hnical “ethical solution” is imple- men ted. Secondly , monitoring committees also need more statistical ex p er- tise, in order to correctly implemen t complex statistical prot o cols. Thirdly , if the monito r ing committee is blinded to the identit y of the treat ment and the con t rol g roup, they should at least b e advised b y a p erson who is not blinded, in order to ensure that the committee neve r mak es decisions base d on a n incorrect guess of whic h g roup is which. Finally: it is imp orta nt to learn from mis tak es. References M.G.H. Besselink, H.M. Timme rman, E. Busk ens, V.B. Nieuw enhuijs , L.M.A. Akk ermans, H.G. Go oszen and the mem b ers of the Dutch Acute P ancreatitis Study Group (2004 ), Probiotic prophy laxis in patien ts with predicted sev ere acute pa ncreatitis (PR OP A TRIA): design a nd rationale of a double-blind, placeb o - con tr olled randomised m ulticen ter trial [ISR CTN38327 949], BMC Sur ge ry , 4:12 doi:10.1186/147 1-2482- 4-12 (7pp.) M.G.H. Besselink et al. (2008), Probiotic proph ylaxis in predicted se v ere acute pancreatitis: a randomised, double-blind, placeb o-controlled trial, The L anc e t , published online F ebruary 1 4 , 2008, DOI:10.1016 /S0140-67 36(08)602 0 7-X (9pp.) H.J.A. Schouten (1999), Klinische Statistiek (“ Clinical Statistics”), Houten: Bohn, Stafleu v an Loghum. S.M..Snapinn (1992), Monitoring clinical tr ials with a c onditional probabilit y stopping rule, Statistics in Me dic ine 11 , 6 5 9–672. J.P . V anden brouc ke (19 99), Dw alingen in de metho dologie XIV. Het v o ortijdig b eindigen v an ee n gerandomiseerde trial (“Metho dological e rrors XIV: Sto pping a randomized trial to o early”), Ne de rlands Tijdschrift vo or de Gene eskunde 143 , 130 5–1308. 14

Original Paper

Loading high-quality paper...

Comments & Academic Discussion

Loading comments...

Leave a Comment