Two-stage re-estimation adaptive design: a simulation study

TwO-STAGE RE-ESTImATION AdAPTIVE dESIGN


IntroductIon
It is being increasingly recognized that a growing expense for biomedical research does not lead to an increased number of successful therapies that enter in clinical practice.
Reasons identified are: (1) a diminished margin for therapeutic improvement, which escalates the level of difficulty in proving drug benefit; (2)  and other business arrangements that have decreased developmental candidates; (4) easy targets are the focus as chronic diseases are more difficult to study; (5) a persistently high failure rate in drug development; and (6) the rapidly escalating research costs and complexity, which decrease willingness/ ability to bring many candidates forward into the clinic [1].
In 2004, taking into account the situation, the Food and Drug Administration (FDA) launched a project, called Critical Path Initiative, aimed at driving innovation in the scientific processes through which medical products are developed, evaluated, and manufactured [2].Later, in 2006, the FDA released the Critical Path Opportunities List, a document collecting 76 projects aimed at filling the gap between the resources spent and the results obtained in clinical research.One of these projects focused on the development of innovative clinical trial methodologies, such as Bayesian approaches or adaptive designs, able to use prior experience or accumulated information in the effort to improve research efficiency [3].
Adaptive designs in particular are defined by the FDA as typical of those studies that incorporate a prospectively planned opportunity for modification of one or more protocol specified aspects regarding patient selection, treatment and assessment, or even the statistical hypotheses to be tested; this is done based on the analysis of accumulated data from the subjects enrolled up to a given point in time.Analyses of the accumulated data performed at prospectively planned time points within the study are also labelled as interim analyses, and can be performed in a fully blinded or an unblinded manner, as well as with or without formal statistical hypothesis testing [4].
There are many proposals on how to conduct an adaptive study.The main design types can be classified according to the phase of clinical research (learning or confirmative phase) in which they are suitable.Designs indicated in the learning phase are (1) adaptive dose finding design and (2) adaptive seamless phase I/II design.Designs suitable in the confirmatory phase are (1) sample size re-estimation design; (2) biomarker adaptive (or enrichment) design; (3) adaptive randomization design; (4) adaptive treatment switching design; (5) adaptive group sequential design; (6) adaptive hypotheses design; (7) adaptive recursive design.Finally, a type of adaptive design useful for both the learning and confirmatory phase is the adaptive seamless phase II/III design.
Among such a variety, the approach that has the largest applicability is the sample size re-estimation design, through which it is possible in the course of the study to increase or decrease the sample size initially planned based on interim results.The reasons for interest in this approach are obvious: one may avoid to oversize the study if intermediate results are favourable, or to miss clinically meaningful treatment effects if intermediate results are less optimistic than anticipated.At the same time, however, there is controversy in the statistical literature about the actual benefits implied by the re-estimation design in comparison with studies incorporating conventional interim analyses [5].
In this paper, we report the results of simulations used for investigating the statistical properties of two-stage sample size re-estimation designs in terms of type I error control, study power and sample size.Such an assessment was made in comparison with classic fixed-sample and group-sequential studies, and considering various options available for sample size re-estimation designs.

Methods
Before describing simulations in detail, it is important to recall the process used to combine the results of each stage of an adaptive study according to the inverse normal method, as proposed by Lehmacher and Wassmer [6], and the available methods for sample size re-estimation.
In general, the test statistic resulting from the combination of independent p-values is given by (1) where k is one of the K planned stages, k = 1,2,…,K, is the p-value of the -th stage and Φ-1(•) denotes the inverse cumulative standard normal distribution function.
An advantage of the inverse normal method, unlike the Fisher combination test [7], is that classical group sequential boundaries for early acceptance or rejection of the null hypothesis e 8 8 6 2 -2 Tw O -STAGE RE-ESTI mATION AdAPTIVE d ESIGN may be used in conjunction with test statistic (1), thus simplifying study design.Since the Φ -1 (1 -p k )'s, k = 1,2,…,K, are independent and standard normally distributed, the proposed approach maintains type I error rates α exactly for any (adaptive) choice of sample sizes [6].
Two methods may be used for samplesize re-estimation.The first one is based on the effect size ratio [8] which requires that re-estimation of the sample size after the interim analysis is given by where N is the newly estimated sample size for the entire study, N 0 is the initial sample size (estimated using the method of a classic study), E is the observed effect size and E 0 is the initial estimate of effect size (used for the estimation of N).
The second re-estimation method is based on the conditional power [1], i.e. the conditional probability of rejecting the null hypothesis during the rest of the trial based on the observed interim data.When using the inverse normal method for combining independent p-values, the re-estimation of sample size after the interim analysis in a two-stage adaptive trial is given by Where N2 is the newly estimated sample size only for the second stage of the trial, σ 2 is the known variance, δ is the target difference between group means, Φ -1 (•) denotes the inverse cumulative standard normal distribution function, α 2 is the critical value for the final analysis according to the chosen decision boundaries, p 1 is the p-value achieved at the interim analysis and cP is the pre-planned conditional power.For obvious practical reasons, sample size extension is typically capped to a suitably chosen maximum overall sample size.
As regards simulations, we assumed the situation of a balanced two-arm trial aimed at comparing two means of normally distributed data.Using a fixed-sample design, 233 observations per trial arm (466 overall) are required to test a standardized effect size δ=0.3 with type I error rate and power set to 2.5% and 90%, respectively.Various scenarios were considered based on a number of factors.
The first one was the information fraction, i.e. the ratio between the sample size at which the interim analysis is conducted and the sample size fixed a priori for the entire study (obtained with conventional methods for the fixed-sample design).Most two-stage studies use an information fraction for the interim analysis equal to 50%, i.e. half of the sample size planned; in addition to this choice, we also considered information fractions of 30% or 70%.
The second factor was the type of group sequential boundaries.We decided to consider O'Brien and Fleming's and Pocock's boundaries, which are the most frequently used and cover two quite heterogeneous settings.While O'Brien and Fleming's boundaries imply monotonically decreasing critical values, making it difficult to stop the trial early but with almost no loss of statistical power, the opposite properties characterize Pocock's boundaries, in which critical values are constant at each stage [9].
Both O'Brien and Fleming's and Pocock's boundaries allow early stop of the study only for efficacy of treatment tested or, in other words, for the rejection of the null hypothesis at the interim analysis.Quite commonly, however, researchers are willing to stop the trial prematurely when results are not sufficiently promising, that is for futility.Therefore, we allowed for this by exploring the use of a futility threshold represented by a p-value at the interim analysis p 1 ≥0.5.
Finally, we considered both re-estimation methods described before, with capping at one and a half the sample size fixed a priori (maximum 350 observations per trial arm) and cP set to 0.90.
For each scenario jointly defined by the above factors, we performed 1x106 simulations and calculated the proportion of simulated studies yielding significant results: (1) under the null hypothesis, H 0 , to evaluate the type I error rates; (2) under the alternative hypothesis, H 1 , to assess the power of the study; (3) in between the null and alternative hypothesis, H 0 /H 1 , as a way to show the gain achievable with sample size re-estimation.In particular, we assumed in this setting an effect size δ=0.2,implying a drop in power of the fixedsample design from 90% to 53%.
Additional simulation outputs were: (1) the futility stopping probability, i.e. the probability of early stopping of the study with acceptance e 8 8 6 2 -3 of the null hypothesis; (2) the efficacy stopping probability, i.e. the probability of early stopping of the study with rejection of the null hypothesis; (3) the average sample size.
Calculations were performed using two macros, written with the statistical software SAS™ (version 9.2) by the authors, which are supplied in the Supplementary Materials.

results
Simulation results obtained using the effect size ratio and the conditional power re-estimation methods are shown in detail in Tables 1 and 2, respectively.
Under H0, any type of adaptive design considered was able to maintain the type I error rate equal to the nominal 2.5% level, with no difference over the distinct scenarios used for testing adaptive designs.A striking result, however, was the increase in sample size when the futility stopping rule was not adopted, compared to the size of the fixedsample study.Such an increase was slightly affected by the chosen information fraction, the type of decision boundary and the method for sample size re-estimation.At worst (70% information fraction, O'Brien and Fleming's boundary, effect size re-estimation method, Table 1) the average increase amounted to 49% (348 subjects per trials arm compared to 233 for the fixed-sample design), which is close to the limit imposed by the capping.Such a drawback was no longer evident when adopting a futility stopping rule.Rather, a gain in efficiency was obtained in this case insofar, at no cost in terms of type I error probability, sample size was somewhat diminished, provided that the information fraction did not exceed 50%.
Results in terms of power and sample size under H 1 and H 0 /H 1 are plotted in Figures 1 and  2. In these plots, power and sample size of the fixed-sample design are represented by horizontal reference lines.In power plots, the dots above or below the reference line indicate whether power is preserved (or even increased) or not in the specific simulation; similarly, dots in sample size plots above or below the reference line indicate whether sample size is increased (loss of efficiency) or decreased (gain in efficiency) compared to the fixed-sample design.
Under H 1 , and using the effect size re-estimation method (Figure 1a), a gain in power beyond the nominal level was always achieved.However, contrary to what one normally expects from group sequential trials, sample size was increased when using the O'Brien and Fleming's boundary, even in the presence of a futility stopping rule.In contrast, a reduction in sample size was steadily obtained with the Pocock's boundary.
The picture was rather different when using the conditional power re-estimation method (Figure 1b).Power was again preserved, at least when adopting an information fraction of 50% or 70%, but sample size was steadily diminished by 7% to 20% compared to that required by the fixed-sample design, with no relevant effect of the type of decision boundary and the use or less of futility stopping.
Under the intermediate hypothesis H 0 /H 1 and using the effect size ratio re-estimation method (Figure 2a), the adaptive approach always improved study power compared to that obtained with a fixed-sample design; the best outcome was obtained with the O'Brien and Fleming's boundary and a 70% information fraction, a condition in which 74% power was achieved.On the other hand, the price to pay to improve study power was an increase in sample size varying between 8% and 32%.
With the conditional power re-estimation method (Figure 2b), study power showed considerable improvement only using an information fraction of 50% or 70%.Compared with the performance of the effect size ratio re-estimation method, such an improvement was slightly smaller, considering that study power reached a maximum of 67-68% (versus 74%), but the price in terms of sample size increase was also smaller, ranging between 4% and 17%.
Finally, the futility stopping probability (FSP) and the efficacy stopping probability (ESP), obviously not affected by the re-estimation method, were influenced by the information fraction and, when considering ESP, also by the type of decision boundaries.In particular, FSP (when allowed for by design) tended to decrease with increasing information fraction under H 1 and H 0 /H 1 .ESP, while stable under H 0 , tended to increase with increasing information fraction under H 1 and H 0 /H 1 ; furthermore, under otherwise similar conditions, ESP was always larger with Pocock's compared to O'Brien and Fleming's boundaries, compatibly with the conservative nature of the latter.Power and average sample size under the alternative hypothesis (H1) according to simulation number (see Tables 1-2 for identifying associated scenarios) using the effect size ratio re-estimation method (Figure 1a) or the conditional power method (Figure 1b) The possibility of adapting some of the characteristics of a clinical trial, such as sample size, based on the results of one or more interim analysis is definitely a great opportunity.It must be underlined, however, that this kind of adaptation is not always applicable or ideal: for example, it is more easily manageable when the response to treatment can be measured and recorded in a short period of time.Furthermore, any type of interim analysis adds complexity to the logistics management of the trial.
These problems aside, from the statistical viewpoint it is important that changes made on the basis of interim results do not adversely affect the operative characteristics of the study.Basically, it is fundamental that type I error rate is maintained as close as possible to the pre-established level under the null hypothesis, and that study power is preserved under a reasonably wide range of conditions not covered by the null hypothesis, with the least possible price to pay in terms of sample size.
Results of our simulations are very instructive in showing that a gain in power is generally obtained with re-estimation adaptive designs, but that at the same time not all choices about the various options available are equally able to ensure good trial statistical properties.Therefore, when designing a study which allows for sample size re-estimation, it is of utmost importance to carefully check the implications of the options specifically chosen.This task is typically performed by simulation, using for instance the macros supplied in the Supplementary Materials or one of the commercially available ad hoc packages.There are however some guidelines that may be drawn from our investigation useful to guide applied statisticians.
First of all, one very obvious and unwanted drawback of re-estimation adaptive designs is the remarkable increase in sample size under the null hypothesis.This disadvantage, however, can be totally prevented by incorporating an early stopping rule for futility, which use is thus to be considered mandatory in this setting.
The choice of information fraction is also crucial.Only if it is sufficiently high (50% or 70% in our simulations), study power shows considerable improvement when intermediate results are less favourable than anticipated.Theoretically, the nominal power might be fully achieved with sample size re-estimation.However, since a clear upfront commitment in terms of sample size is typically required by both the sponsors and ethical committees, limitation in the maximum sample size is commonly adopted, which limits as well maximum power recovery.Again, the "capping threshold" (1.5 in our simulations) is one of the study design parameters that need to be assessed in the planning phase under reasonable scenarios.
Regarding re-estimation methods, we showed more favourable properties for the conditional power method.Such a procedure, compared with the effect size re-estimation method, is more efficient (sample size is steadily diminished) under the alternative hypothesis, as well as in conditions intermediate between the null and alternative hypotheses.
Finally, when considering the joint use of futility stopping, 50-70% information fraction and conditional power re-estimation method, only trivial differences emerged in the statistical properties of O'Brien and Fleming's or Pocock's decision boundaries.It must be considered, however, that researchers are usually willing not to stop the trial too early because of favourable results, which is a justification for preferring the more conservative O'Brien and Fleming's boundaries.
As mentioned in the Introduction, there is controversy about the actual benefits implied by the re-estimation design in comparison with studies incorporating conventional interim analyses.In such a comparison, the most critical aspect from the statistical viewpoint is the loss of statistical efficiency resulting from the weighting scheme used to combine the results of each stage of an adaptive study according to the inverse normal method, as well explained by Fleming [10].This author supports his argument on the basis of an example in which first and second stage sample sizes, 200 and 1 100 respectively.are greatly imbalanced.The imbalance, however, is unlikely to be so relevant in most practical conditions.The consequent loss of efficiency might thus be trivial and easily offset by the advantage of a smaller up-front investment of sample size resources, followed by a larger subsequent investment contingent on seeing promising results from the interim analysis, whenever a futility stopping rule (as done in our simulations) or conditional power calculation are incorporated into study monitoring.In conclusion, although the presented simulations are limited to a restricted set of possible scenarios, our findings are in agreement with theory and may be regarded as sufficiently generalizable to other settings.In any case, as underlined before, applied statistician willing to adopt the type of adaptive designs here discussed must carefully check their properties in the specific situations through simulation of a number of possible scenarios.In this case, as we have shown with our investigation, re-estimation adaptive designs may actually improve the statistical quality of clinical trials.
genomics and other innovative scientific tools having not yet reached their full potential; (3) mergers e 8 8 6 2 -1 O RIGINAL ARTICLES Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 1 Tw O -STAGE RE-ESTI mATION AdAPTIVE d ESIGN Public Health -2014, Volume 11, Number 1 Tw O -STAGE RE-ESTI mATION AdAPTIVE d ESIGN figure 1 figure 2 Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 1 Tw O -STAGE RE-ESTI mATION AdAPTIVE d ESIGN

table 1 two
-stage adaptive design with the effect size ratio re-estimation method* *Futility stopping probability (FSP), efficacy stopping probability (ESP), average sample size and probability of rejecting the null hypothesis by varying information fraction, decision boundaries and futility stopping rule e 8 8 6 2 -5 O RIGINAL ARTICLES Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 1 Tw O -STAGE RE-ESTI mATION AdAPTIVE d ESIGN