Allocating the Sample Size in Phase II and III Trials to Optimize Success Probability

{\bf Background} Clinical trials of phase II and III often fail due to poor experimental planning. Here, the problem of allocating available resources, in terms of sample size, to phase II and phase III is studied with the aim of increasing success rate. The overall success probability (OSP) is accounted for. {\bf Methods} Focus is placed on the amount of resources that should be provided to phase II and III trials to attain a good level of OSP, and on how many of these resources should be allocated to phase II to optimize OSP. It is assumed that phase II data are not considered for confirmatory purposes and that are used for planning phase III through sample size estimation. Being $r$ the rate of resources allocated to phase II, $OSP(r)$ is a concave function and there exists an optimal allocation $r_{opt}$ giving $\max\{OSP\}$. If $M_I$ is the sample size giving the desired power to phase III, and $kM_I$ is the whole sample size that can be allocated to the two phases, it is indicated how large $k$ and $r$ should be in order to achieve levels of OSP of practical interest. {\bf Results} For example, when 5 doses are evaluated in phase II and 2 parallel phase III confirmatory trials (one-tail type I error $=2.5\%$, power $=90\%$) are considered with 2 groups each, $k=24$ is needed to obtain $OSP\simeq 75\%$, with $r_{opt}\simeq 50\%$. The choice of $k$ depends mainly on how many phase II treatment groups are considered, not on the effect size of the selected dose. When $k$ is large enough, $r_{opt}$ is close to $50\%$. An $r\simeq25\%$, although not best, might give a good OSP and an invitingly small total sample size, provided that $k$ is large enough. {\bf Conclusions} To improve the success rate of phase II and phase III trials, the drug development could be looked at in its entirety. Resources larger than those usually employed should be allocated to phase II to increase OSP. Phase II allocation rate may be increased to, at least, 25\%, provided that a sufficient global amount of resources is available.


Introduction
It is common knowledge that the aim of phase II clinical trials is mainly exploratory, while that of phase III is confirmatory, and that phase II also serves to enhance planning for the subsequent phase III.Usually, phase II is small with respect to phase III, and the rate of trial failures, which is around 60% and 40% for phase II and III respectively suggests that this habit might not be helpful.In general, low success probabilities are often due to low sample size [1].Here, we study sample size problems from the perspective of a drug development project, which means considering jointly phase II and phase III sample sizes.
To introduce the problem, by way of example, let us suppose that a phase II trial has been run, with 2 parallel arms each with 60 patients, and that a phase III with the same design needs to be planned.Assume that the efficacy value (standardized effect size) of minimum interest that should be observed to then launch phase III is 0.15 and that a value slightly higher than this has been observed in phase II, so that phase III has to be launched.With one-sided α = 2.5% and power 1 − β = 90%, approximately 940 patients should be recruited for each group, if the observed effect size is adopted for sample size computation -namely pointwise strategy [2,3].This number (940) is quite high, but not beyond the range of those usually adopted in phase III trials (visit clinicaltrials.gov,a service of the U.S. NIH).So, assuming that the research team decided to actually launch phase III, the total number of patients enrolled in about 2,000.Now, the point is: if the resources for studying 2,000 patients were actually available, would there be an allocation of sample size better than 60/940?Would, for example, 400 data allocated to phase II and, at most, 600 to phase III has been a better choice?It is worth noting that we wrote at most because when 400 data per group come from phase II and are used for estimating the phase III sample size, this is not necessarily 600, where it is almost surely lower.Moreover, what does "better allocation" mean?And, is there an optimal allocation?Besides dose selection and safety evaluation, the aims of phase II are to correctly decide go/no-go, to launch phase III (i.e.go) with a high probability if a meaningful effect really exists, and to estimate well providing the desired probability of success of phase III); the aim of phase III is to prove efficacy with a high probability, once again if a meaningful effect really exists.Hence, the aim is to succeed with high probability in both phases, whenever the drug under study actually works well.
In this paper we study sample size resource allocation in terms of overall probability of success (OSP): we focus on the amount of resources that should be provided to phase II and III trials so as to attain a good level of OSP, and on how many of these resources should be allocated to phase II to optimize OSP.Since not all the resources allocated to phase III are spent, depending on sample size estimation, we also focus on the actual amount of resources used.It is assumed that phase II data provide information for phase III planning and are not used for phase III confirmatory analysis.
Analogous computations on success probability have recently been proposed by Jiang [4] under the Bayesian framework.Here, the frequentist approach is adopted: this is due to poor performances of Bayesian sample size estimators (proposed, for example, by Chuang-Stein [5]) in terms of high variability of their results [3].

Theoretical framework 2.1 Drug development model
It is assumed that a certain disease is under study, and that h doses of a new drug for the disease of interest are evaluated in a phase II trial (h often varies from 3 to 7).Also, a placebo arm is run.A classical parallel design is applied in the exploratory phase II, with h + 1 groups.If phase II results are promising, a single dose D is chosen and 2 phase III trials comparing to placebo are run, once again under parallel design.It is also assumed that the three trials (1 phase II and 2 phase III) share the same response variable and the same patient population, meaning that the effect size of the elected dose is the same in both phases.These assumptions allow simple sample size estimation, with no need for further adjustments such as those in [6], and are similar to the assumptions in Jiang [4], A certain limited amount of resources is available to develop phase II and III trials, and this translates into a total of at most w patients.Let r ∈ (0, 1) be the rate of w allocated to phase II: if a sample of size n is studied for each treatment in phase II, then n is, approximately, rw/(h + 1).Consequently, the whole sample size available for phase III is w(1 − r), which is not used entirely (almost surely).Indeed, the phase III sample size actually adopted for each group (M n ) is estimated on the basis of phase II data and is a random variable whose maximum is w(1 − r)/4 (4 are phase III groups).

Phase II tools
Let δ = (µ D − µ P )/σ be the generic standardized effect size, µ D and µ P the means of response variables of the populations under D and under placebo.Without loss of generality σ = 1.The true, unknown, effect size is δ t .XD,n and XP,n are the means of measurements and d n = XD,n − XP,n is the pointwise estimator of δ t .Call L the random event representing the success of phase II -L stands for phase III launch.L can be defined in some different ways: for example, on the basis of the maximum sample size m max for phase III (i.e.L ⇔ M n ≤ m max ), or the basis of the observed effect size overcoming a threshold of clinical relevance (i.e.L ⇔ d n > δ 0L ).Kirby et al. [7] evaluated some further launching rules, which, through simple algebra, can be reduced to L ⇔ d n > δ 0L , for some values of δ 0L .Note that the first two launching criteria above can be set to result mathematically equivalent ([1], Ch.3).Although the launching rule based on δ 0L is the most intuitive, and one of the most used, let us adopt the one pragmatically based on m max .In this framework constrained by w and modeled by r, the actual launching rule becomes

Phase III tools
The Z-test is adopted with one-sided alternatives.Being m the generic sample size, is the test statistic and the success probability, according to [1] 1 − β is the desired power to be achieved in each phase III trial (e.g.90%); then, the ideal sample size per group for each phase III trial is: Once phase II has succeeded, phase III is run with the sample size estimated by the 2n phase II data.Several sample size estimation strategies can be adopted [1].
Here, M I is estimated by the pointwise estimator based on the observed effect size The adoption of the pointwise strategy is made for simplicity and also because its performances in terms of OSP and MSE are acceptable although not best [3].Two confirmative phase III trials are run simultaneously and independently, each group with M n patients, so that the success probability is the random variable The mean of (SP (M n )) 2 , conditional to L, is of main interest and, although it has been called Average Power by Wang et al. [2], we call it the SP of phase III:

Defining OSP
Let us assume that the quantity to be optimized is the Overall Success Probability (OSP), that is the joined probability of success of phase II and phase III (in the recent past, OSP has been called Overall Power [8,3,1]).Since the results of the two phases are independent -it is assumed that phase II data are not included in the analysis of phase III data, OSP is given by the product of the success probabilities of phases II and III: We expect OSP (r) to be low for small values of r, due to a low launch probability (i.e.SP II (r)).Also, OSP (r) is expected to be low for high values of r, due to low values of SP III (M n ) since M n is limited by a low value m max (r).Then, there is an intermediate allocation of resources that optimizes OSP (r), that is 3 Behavior of OSP

Settings
It has recently been shown [3,7], that the threshold of clinical relevance δ 0L should be set not too close to δ t , in order not to penalize SP II .A threshold around δ t /3 is therefore set, accordingly.Phase III type I error is 2.5%, where the power is [2,3]).Three effect size values are considered (δ t = 0.2, 0.5, 0.8), providing ideal phase III sample size M I s resulting 526, 85, 33, from (??).For each δ t , the whole sample size w is taken equal to kM I , with k = 10, 15, 20, 25, 30.Three numbers of doses h are accounted for: 3, 5, 7.For each of the 45 settings (3 δ t s × 5 ks × 3 hs), r is considered varying from 5% to 95%.

Simplifying launching rule
Let us translate the launch threshold based on the effect size into that based on the maximum sample size, and then simplify OSP formulas.Being since the latter emerges in all the settings higher than δ t /3: the launching rules  based on the constraint on sample size given by the available resources are stricter than δ t /3 (see Figure ??, where δ 0L (r) as a function of r is reported -varying δ t s and hs the curves result very similar).Hence, if Note that this stricter launching rule penalizing the probability of launching phase III (i.e.SP II ) is imposed by the model we are studying.and max{OSP } 63%, respectively.With k = 20, max{OSP } 72% -the optimal rates r opt s under different δs are very close (i.e.r opt 47%).For k = 25, 30, we have max{OSP } 76%, 78%, with r opt 52%, 60%, respectively.For k ≥ 20, OSP (r) shows a quite flat shape around its maximum r opt , meaning that even if the rate r allocated to phase II is a bit smaller than r opt , OSP (r) still provides acceptable values.For example, OSP (30%) 70%, 74%, 76%, with k = 20, 25, 30, respectively.Finally, r opt moves from 34% to 60% with k increasing from 10 to 30: when w increases, the best solution is to allocate more and more sample size to phase II, to improve both SP II and the precision in estimating M I .In  (even higher if h = 7) is suggested to reach suitable OSP values.

Sizing the whole amount of resources
Since the desired SP III is (90%) 2 = 81%, and a good SP II level may be still around 90% -we remind that the aim is to study how to increase the success rates in clinical trials, we consider OSP around 72% as acceptable.
As a rule of thumb, in order to obtain an OSP 75%, with a number of phase II groups ranging from 2 to 10 (and 2 phase III confirmatory trials) provide to the whole development project sufficient resources to recruit a number of patients from 15 to 30 times (increasing linearly with h) the ideal sample size M I , and allocate about 50% of the sample size to phase II, regardless of the amplitude of δ t .
Moreover, if just 20% of resources is allocated to phase II, OSP remains near 70%, provided that resources to reach OSP = 75% are stored before starting phase II.
We remark that not all the stored resources are used: rw is actually spent in phase II, where the global phase III sample size 4M n is at most (1 − r)w (see next Section).

Assuring the whole amount of resources
The problem that in practice δ t is unknown does not influence the allocation choice based on OSP, since OSP (r) is almost independent of δ t .Nevertheless, to allocate enough resources to obtain a given OSP level, depends on δ t .In particular, since we adopted M I as a unit measure for w, the resources needed depends on δ t through M I .
In practice, the unknown M I should be replaced by , where δ a is the assumed effect size.However, how close M a is to M I is unknown.To reinforce the assumption on δ a and limit the uncertainty of parameter, assurance can be applied [9].This consists in defining a distribution around δ a (viz.f δa (t)) so that the assured sample size becomes (t) dt -it can be viewed as Bayesian sample size determination, where f δa (t) plays the role of the prior distribution.
For example, when the uniform prior f δa (t) = 1/(2δ a ), t ∈ (δ a /2, 3δ a /2), is adopted, we The linear rule of thumb above, through assurance, suggests providing the whole development project when h = 5 with sufficient resources to recruit 22.5M A patients, i.e. 22.5 × 4/3 = 30 times the assumed sample size M a .A lower assurance provides 22.5 ≤ k ≤ 30.

Mean and variability of total sample size
An indispensable aspect of this sample size allocation problem is to evaluate the actual amount of resources spent in phase III, as well as those spent overall, depending on the behavior of the sample size estimator M n .
In Table 2, the average and the MSE are shown with k = 15, 20, 25, 30, h = 5, and with r = 25%, 50%, 75%.When k increases and r is fixed, both mean and MSE of M n |L increase.Mainly, the estimation process becomes more reliable when r increases: the mean of M n |L tends to M I and MSE decreases.Moreover, when k = 25 and r = 50% (viz.operating conditions giving high OSP when h = 5), the mean of M n is close to M I and the mean error is about M I /2, for every δ t .Indeed, the behavior of M n is almost independent of δ t , in accordance with that of OSP .Now, let us consider how these numbers reflect on the whole amount of resources spent in both phases, viz. on the total sample size M T = M I × k × r + 4M n .From a practical standpoint, the settings with k = 20, 25 and r = 25%, 50% are the most interestingk = 30 provides OSP higher than requested, and with k = 15 the OSP is often low; also, OSP is low with r = 75%, due to strict constraints for M n .When k = 25 and r = 50%, and with δ t = 0.5 giving M I = 85, the average amount of M n , and so for M T , can be obtained through conditional probability calculation: for example, with δ t = 0.5 and under the latter setting (i.e.n = 25 × 85 × 50%/(5 + 1) 177), the 80% and 90% percentiles are m .8177 = 122 and m .9177 = 151.Once again, percentiles present small variations in function of δ t .Mean, standard deviation and percentiles of M T for the four settings considered of main interest are reported in Table In the light of these further results, even allocation rs that do not provide optimal OSP may be of practical interest.For example, when k = 25 is adopted (i.e., w = 25M I is stored) and r = 25% of resources are allocated to phase II, the average of M T is 11.1MI and M T does not overcome

Discussion
Although the development of a drug, and in particular the clinical part regarding phase II and III trials, might be looked at in its entirety, scientists and trial managers often tend to focus on each phase separately.In particular, resources to develop the research project are often funded for each phase separately.It is a fact that the failure rate of phase II and phase III clinical trials is quite high.
Here, the assumption is that the whole amount of resources to develop phase II and III trials (in terms of sample size) is stored, and therefore potentially available before starting phase II.We studied the problem of allocating the resources to the two phases -to be precise, resources allocated to phase II are all used, where those used in phase III are at most those left, depending on phase III sample size estimation based on phase II data.
It was assumed that 2 phase III trials are run with a sample size estimated on the basis of phase II data.The overall success probability (OSP) has been evaluated as a tool for planning experiments, in accordance with some recent papers [8,4,3], and the variability of the resources actually spent has been accounted for.
We showed that to obtain a sufficiently high OSP (e.g.75%) when the number of doses evaluated in phase II goes from 3 to 9, the whole amount of resources needed varies (linearly) from 19 to 31 times M I .This is almost regardless of the effect size of the dose selected in phase II.Moreover, to obtain the optimal OSP, the rate of resources to be allocated to phase II is often close to 50%.Even an amount of resources of 25% might give a good OSP and an invitingly small total sample size if allocated to phase II, provided that a sufficient amount of resources is stored to the two phases.If the whole amount of resources available for the two phases is low, the OSP will be low too, even lower than 50%, even if the best allocation of resources is made.Since M I depends on the unknown effect size of the selected dose, wrong assumptions regarding the latter can cause too small investments and low OSP.To reduce this risk, M I may be computed by applying assurance [9] on effect size assumptions.
The observed phase II effect size was adopted to compute phase III sample size: being aware of the variability in effect size estimation, conservative sample size estimation strategies may be adopted, as in [1].The OSP can, therefore, result in a considerable increase (i.e. about 3% when OSP 75% -unpublished result).
Allocations near 50% providing the optimal OSP are usually not adopted in clinical practice: phase II often absorbs less resources than phase III.Indeed, the size of samples adopted in phase II is, on average, 10-15% of the total sample size of the two phases [1].To improve the success rate of phase II and phase III trials, the drug development could be looked at in its entirety, and phase II allocation might be increased to, at least, 25%, provided that a sufficient global amount of resources is available.Then, a more accurate phase II would also induce a higher probability of choosing the best dose among those considered.Nevertheless, larger phase II trials imply higher costs and longer times for the development project, allowing for a shorter patent life and so lower potential profits, of course in case of successful trials.Allocation of resources should also be evaluated from an economic perspective, as suggested also by Jiang [4].For this reason, our future works may focus on the relationship among allocations, OSP, efficacy and safety utility functions, costs, revenues, and profits, according to [10,11].The indications on the amount of resources to be allocated to phase II suggested by Jiang [4] differ from ours, but in that paper only 2 phase II groups and 1 phase III trial are taken into account.Differences between our indications and those provided by Stallard [12] are much more evident, since phase II data are considered only for detecting a certain effect with low power, not for adequately planning phase III.
e 9 9 5 8 -1 5 the drug effect size to indicate a phase III sample size (M) as close as possible to the ideal one (i.e. the one Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy where h = 1 and only one phase III trial were considered.Here, all trials are run under balanced sampling.
Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy

3. 3
Computing OSP OSP functions, with h = 5, are reported in Figure ??: OSP (r) under different δ t s are very similar -they lie approximately on the same curves.Differences among OSP (r) from different ks are evident: the values of OSP (r) increase when k increases.OSP levels that look acceptable are obtained when k is Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy at least 20 (see Figure ??).When k = 10, 15, we have max{OSP } = OSP (r opt ) 45%,

r
70% = min{r s.t.OSP (r) ≥ 70%}).Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy Table 2 resources spent is E(M T ) = 85 × 25 × 50% + 4 × 94.0 = 1438.517M I , with a standard Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy deviation of σ(M T ) 2M I -recall, this is almost independent of δ t .Percentiles for Public Health -2014, Volume 11, Number 4 ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy 1 9M I .For all 45 settings, the maximum sample size introduced by the constraint of available resources, i.e.The actual launching threshold for the effect size becomes δ 0L w(1 − r)/4, results lower than 9M I .Consequently, m max (r) turns out to be w(1 − r)/4.The OSP in equations (??) is computed by replacing m max (r) with w(1 − r)/4.

Table 1
Values of k to join 70% and 75% of OSP

Table 1 .
Minimum values of k to join a max OSP of at least 70%, 75% (viz.

Table 3 .
Standardized mean, st.dev.andpercentiles of the total expenses in terms of sample size (viz.M T ), through percentiles of M n , obtained with α = 0.025, 1 − β = 0.9, h = 5, k = 20, 25 and r = 25%, 50%; δ t = 0.5 has been adopted.Let's continue the introductory example, where the sample size of each phase II group was n = 60.Assume here that h = 7 doses are studied in phase II, and that two phase III trials are launched if d 60 > δ 0L = 0.15.Setting α = 2.5% and 1 − β = 90%, if δ t = 0.4 then M I = 132; also, m max = 940.Note that if δ t was 0.4, to obtain an observed value of d 60 near 0.15, and so a phase III sample size estimate close to 940, Considering w = 20 times 132 (i.e.k = 20 times the ideal phase III sample size), the maximum of OSP is just 67.8% (r = 47%).Now, assume that w is increased up to ALLOCATING THE SAmPLE SIzE IN PHASE II ANd III TRIALS TO OPTImIzE SuCCESS PROBABILITy = 25, in accordance with Table1, so that resources for treating a total of 25 × 132 = 3300 patients are available for the allocation into the two phases.Then, things go better: with r from 29% to 68% the OSP is higher than 70%.In detail, max{OSP } = 73.5% with r = 51%, where the SP of phase III (also called Average Power) is 76.4% and the launch probability is 96.2%.This best r = 51% gives n = 211 (i.e.1688 patients to be enrolled in phase II) and a maximum phase III sample size, per group, of 403.Actual values of phase III sample size result often lower than 403: the average of M 211 is 146.85, and its standard deviation is 69.74.Consequently, the average and the standard deviation of the total sample size M T are 2276 and 279.This corresponds to, about, 17.2MI and 2.1M I , respectively, meaning that not all the w = 25M I resources would be spent.