Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models.

Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models


INTRODUCTION
Non-inferiority controlled clinical trials having the proportion of success or failure as the primary outcome are increasingly being carried out particularly in the fields of cardiology, oncology and antibiotics and their ethical nature [1,2] nowadays has to be taken for granted.
The opposite roles of the null (H 0 ) and alternative hypotheses (H A ) makes their rationale a little counter-intuitive and their statistical testing not easy as it has been showed firstly by Dunnett and Gent [3], as a formal statistical significance test or from the exact confidence intervals of the odds ratio according to Gart [4].
Furthermore, the choice of the maximum difference that is not clinically/biologically relevant leading to an Experimental drug being considered "non-inferior" (i.e. the Non-Inferiority Margin: NIM) is central to the scientific and ethical plausibility of a non-inferiority trial and the validity of its conclusions.
We cannot consider in detail the suggestions about the choice of the NIM given by regulatory guidelines [5,6,7,8,9], the discrepancies between the FDA and the European Medicines Agency (EMA) guidelines for trials in diabetes mellitus [10,11] and in infectious diseases [12,13,14] together with the methodological attempt to reconcile them by Röhmel [15], and the several proposals, based on a percentage of the expected difference between the standard and the placebo, suggested, among others, by Holmgren [16], by D'Agostino et al. [17] by Snapinn [18] or by Pigeot et al. [19].Finally, further recent insights are in Hung et al. [20,21], Wiens [22], and Tsong et al. [23].
At least, it has to be stated that the choice of the NIM has to be based on clinical and statistical criteria, that the NIM has to be lower than the smallest difference between the Standard and placebo, and the NIM has to be justified, on the fact that the Experimental (which is expected to be non-inferior) clearly has a real advantage over the Standard (easier administration, fewer adverse events because of its biologically well-documented mechanism, lower cost, etc.) [24].
In addition, the choice of the NIM in two arms trials (without a placebo for ethical reasons) has to fulfil "assay sensitivity" (the Experimental is efficacious in the sense that it would be superior to placebo or the previous standard) and the "constancy assumption" (the Standard effect remains the same) [17].
A second aspect is the parameterisation of the NIM.The odds ratio scale has been supported by Julious [25], Garrett [26], Senn [27], Tu [28], Siegel [29], Kaul and Diamond [30] by Wang et al. [31], by Chow et al. [32] who showed some sample size calculations for equality, non-inferiority/superiority and equivalence trials in the context of parallel and cross-over designs, and, finally, by the FDA guideline [8], but only in the case of a lower event rate, or when the reference treatment is expected to have response rates near 0% or 100% according to the CPMP guideline [6].
However, it is necessary to consider the impact of odds ratio parameterisation on sample size calculation and, in the words of PASS ® [33] "As a rule of thumb, the difference is best suited for those cases in which 0.20<P< 0.80", taking also into account that this scale is more familiar to clinicians.
In this paper, we consider the parameterisation of the difference (D, the most familiar to clinicians), the ratio (R), the natural logarithm of the ratio [LR], and the odds ratio (OR) or, better, the natural logarithm of the odds ratio [LOR].
Furthermore, as a third relevant point there is still no agreed approach to sample size calculation for the case of the difference between two proportions.
Apart from the papers of Makuch and Simon [34], Blackwelder [35] dealing with the sample size calculation for the comparison of two proportions of success in non-inferiority studies and the papers of Blackwelder and Chang [36], Heiselbetz and Edler [37] providing graphs of the sample size and a computer program, respectively, we want to draw the attention on the Farrington and Manning's paper [38] owing to its relevance.
Indeed, Farrington and Manning [38] considered three methods of obtaining the approximate variance of the difference between two proportions under the null hypothesis of a non-zero difference; i) the "observed values" of Dunnett and Gent [3], Makuch and Simon [34], and Blackwelder [35] (method 1); ii) the values obtained from the "fixed marginal totals" of Dunnett and Gent [3] and Rodary et al. [39] (method 2); and iii) the values obtained using the "maximum likelihood estimation" such as the solutions of a cubic equation according to Miettinen and Nurminen [40], but disregarding the term (N1-1)/N1, which is negligible in large samples (method 3).It has to be pointed out that method 3 has also been proposed as a means of overcoming some of the serious drawbacks of the first two methods: the poor coverage of method 1, and the fact that the values obtained from the constrained estimation under H 0 of method 2 have to satisfy some easily violated inequalities [38].Farrington and Manning [38] also showed the sample size calculations for a non-unity relative risk using the three methods, and included a sample size imbalance that is equal to the ratio between the sample sizes of the Experimental and the Standard.
However, in Farrington and Manning's paper [38] there is not clear distinction between the probabilities of success or failure (being positive or negative the non-inferiority margin of the difference and more or less than 1 the non-inferiority margin of the ratio), some of the considered scenarios are better suited to the H 0 formulation of a "superiority test" according to Chow et al. [32] and the sample sizes shown in Tables I and II [38] are not reproduced by using the PASS ® software [41,42,43] since it is not pertinent to calculate the sample size in the case of the probability of the Experimental greater favourable outcome parameters: the subscripts "S", "St" and "Ex" respectively stand for Success, Standard treatment and the Experimental new treatment, which is expected to be "not inferior" to the Standard.
In order to adopt a unified approach to the statistical significance test and sample size calculation, let us define T S the statistic of interest: i.e. the difference between two success proportions (D), their ratio (R) and its natural logarithm (LR), their odds ratio (OR) and its natural logarithm (LOR), and assume that the distribution of T S can, under suitable conditions, be approximated to a Gaussian distribution with mean value ( T S µ ) and variance ( T S 2 σ ).Finally, as the expected value and the variance of T S are different under H 0 and H A , the subscript "H 0 " and "H A " will be used: i.e.T _ H S A σ under H A .

2.A.1 Formulation of the H 0 and H A hypotheses
With  S the parameter of interest and  0_S the maximal clinically/biologically irrelevant threshold (the non-inferiority margin), the non-inferiority hypotheses are: H θ ≤ θ _ : for inferiority A S 0 _ S H : θ > θ for non-inferiority

2.A.2 Statistical significance test
Given the above formulation of H 0 , the non-inferiority statistical test will always be one-sided (on the right) with an approximate test function given by: (2.A.2) Defining t S as the sampling value of T S , and with a significance level of α = 0.05 two-sided (or, equivalently, 0.025 onesided), H 0 will be rejected if z>z 1-α/2 or t S >t c , where t c is the quantile delimiting the critical region: c T H T H S S t z −α =µ + σ _ 1 /2 _ 0 0 .However, usually, the non-inferiority H 0 hypothesis is rejected if the lower limit of the 95% confidence interval of the difference standard minus experimental is greater than the positive non-inferiority margin.The rejection of the null hypothesis would make the non-inferiority of the Experimental the most plausible conclusion.
It should be mentioned, in accordance with the definition that H 0 is greater than or equal to the non-inferiority margin, that statistical significance occurs when the test statistic is greater than the corresponding quantile of the Z distribution; however, in the case of continuous variables, this clarification is pragmatically irrelevant.

2.A.3 Sample size calculation
The rationale underlying the sample size calculation requires the simultaneous occurrence of two events: a statistically significant result (under H 0 ) and the rejection of H 0 under H A (defined as the power of the test): Solving the basic inequalities for t c gives: and: Note that z  has been replaced by -z 1- because of the symmetry of the Z distribution.Finally, by equating the above expressions to t c , the sample size can be calculated using the following general pivotal formula: (2.A.3) which has to be explicitly solved for the sample sizes of the Experimental (n S_Ex ) and the Standard (n S_St ).After having put n S_St = k•n S_Ex , the following quantities can be defined in a conveniently simplified form (n S_Ex is explicitly included because it is the solution of the sample size formula): In this way, the general equation becomes: which, resolved by n S_Ex , gives the general formula for the sample size calculation of the Experimental: This formula has to be appropriately adapted to the parameters of the considered models and, in order to allow for an unequal allocation, the ratio of the two sample sizes (k = n S_St / n S_Ex ) has to be calculated, with n S_St being calculated as k•n S_Ex rather than by using an ad hoc equation similar to 2.A.3.1.

2.A.4 Power of the statistical significance test
Given that , the power (1-) of the statistical test under H A is given by: (2.A.4.1) In addition, the power of the test can be more intuitively obtained by means of the sample size calculation formula resolved by z 1- , and then by calculating its corresponding probability value (1-): (2.A.4.2)

2.B. Failure probability
The failure probabilities of the Standard and Experimental are respectively indicated by π F_St and π F_Ex , and the sample proportions (p F ) are independent, random binomial variables.The theoretical derivation is very similar and, consequently, has been moved to the paragraph 2.1 of the Appendix 2 of the supplementary material.

3.A. Success probability (the corresponding treatment of the failure probability is shown in the supplementary material, paragraph 2.2 of the Appendix 2.).
For each Model, we give the null (H 0 ) and alternative (H A ) hypotheses, the sample distribution, and the formulae for testing H 0 and calculating the sample size and power.

3.A.1.1 Null (H 0 ) and alternative (H A ) hypotheses
The general parameter  S becomes as the difference between the two true probabilities, and  0_S is the non-inferiority margin (the Greek lower case letter is adopted because it is widely used in the statistical literature).
The non-inferiority hypotheses are: where 0 <  0_S < 1 As  0_S is positive, a negative value (- 0_S ) is given as it is expected that π S_Ex < π S_S t.In addition to these theoretical limits, the upper limit of  0_S depends on π S_St and is the value of "no clinical/biological difference" compatible with the non-inferiority model.If the lower limit is too near to zero the sample size would be so large that the study would become unfeasible.
The null hypothesis (H 0 ) is the hypothesis that the difference is less than a negative non-inferiority margin (- 0_S ), and the alternative hypothesis (H A ) is the hypothesis that the difference is more than the non-inferiority margin, and can consequently be pragmatically considered as clinically or biologically irrelevant.

3.A.1.2 Sampling distribution
It is necessary to consider that, under H 0 , the sampling distribution of D S is shifted: . Under suitable conditions (generally, large sample sizes) it can be approximated to a Gaussian distribution with mean value and standard error: Its expected values and variances are different under H 0 and H A , as will be emphasised by using the subscripts _H 0 and _H A .Under H 0 , we have: and under the non-inferiority H A , we have: As also pointed out by Farrington and Manning [38], it has to be remembered that the unknown probabilities π S_Ex and π S-St in the formulae of the variance under H 0 have to be replaced by the estimates obtained using the observed values (method 1), the values obtained by fixing the marginal totals (method 2), or the maximum likelihood estimates (Method 3).

3.A.1.3 Significance test
The inferiority H 0 is rejected at the /2 significance level if: or: However, a general and easier approach to all of the models is to consider H 0 rejected when the lower limit of the 95% confidence interval is >- S_0 (the non-inferiority margin).

3.A.1.4. Sample size calculation
Let us define the sample sizes in the two treatment groups as Once again, it is necessary to use the estimates in accordance with one of the three approaches described by Farrington and Manning [38].Example 3.A.1.4.(the estimates are calculated using the Farrington and Manning's method 3 [38]) Assuming π S_St = 0.65 and π S_Ex = π S_St = 0.65, and having fixed the non-inferiority margin ( 0_S ) at 0.075, with a two-sided significance level (α) of 0.05 leading to z 1-α/2 = 1.96, and a power (1-β) of 0.90 leading to a z 1-β = 1.2816 with (z 1-α/2 +z 1-β ) 2 = 10.507971, the sample size for the experimental group, which is equal to that of the standard group (k = 1) is easily obtained from Formula 3.A.1.4as n DS_Ex = 849.98≈ 850.
It is worth repeating that, in non-inferiority settings, a very sensible starting point is to put π S_St = π S_Ex , thus assuming that the Experimental is at least as effective as the Standard, because this is in line with the equipoise position that allows ethically feasible randomised controlled trials.If the effectiveness of the two drugs is considered to be different, adding the non-inferiority margin (δ 0_S ) could lead to a clinically relevant total difference that is unsuitable for a non-inferiority study.
Epidemiology Biostatistics and Public Health -2020, Volume 17, Number 1 Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models

3.A.1.5. Power calculation
Under the alternative hypothesis that D T_S > -δ 0_S , the power of the above test is: Alternatively, solving z 1-β and then calculating the probability value of this quantile, we have: (3.A.1.5.Bis) 3.A.2.Second Model: ratio between two success probabilities (the term "relative risk" should be restricted to the ratio of two failure probabilities).

3.A.2.1.1-Null (H 0_S ) and alternative (H A_S ) hypotheses
The general parameter θ S becomes , with R 0_S the non-inferiority limit of R T_S , and the null and alternative hypotheses are: As it is expected that π S_Ex < π S_St , we have 0 <R 0_S <1.The null hypothesis (H 0 ) is the hypothesis of inferiority given by a ratio that is less than the non-inferiority margin R 0_S , and the alternative hypothesis (H A ) is the hypothesis of a ratio that is more than the non-inferiority margin and, consequently, a clinically or biologically irrelevant value.
It should be noted that the above conditions correspond to those of Example 3.A.1.4for model 1 as R 0_S is obtained from δ 0_S by means of the conversion formula: R 0_S = (π S_St -δ 0_S ) / π S_St .In this case, the sample size of 758 in each treatment group is much lower than the 850 obtained using model 1.
Finally, it should be pointed out that, if R 0_S is rounded to 0.885, n RS_Ex = 762.89≈ 763: i.e. an increase of just four-tenths of a thousand in the non-inferiority margin expressed as the ratio leads to an increase of five subjects in each treatment group.

3.A.2.1.5. Power calculation
Under the alternative hypothesis that R T_S >R 0_S , the power of the above test is: Alternatively, by solving z 1-β and then calculating the probability value associated with this quantile, we have: (3.A.2.1.5.Bis)

3.A.2.2.1. Second Model Extension: (natural) logarithm of the ratio between two success probabilities.
With LR T_S =In(R T_S ) and LR 0_S =In(R 0_S ), the previous null and alternative hypotheses concerning the ratio between two success probabilities are re-written with the log-transformation as: As 0< R 0_S <1, we have LR 0_S <0.

3.A.2.2.2 Sampling distribution
It is necessary to consider the shifted sampling distribution of LR S =In(P S_EX )-In(P S_ST )-LR 0_S The following formulae show the expected value E(LR S ), with the standard deviation (σ LRS ) being calculated using the delta method: For large sample sizes, this distribution can be approximated to a standardised Gaussian formula:

3.A.2.2.3 Significance test
The null hypothesis is rejected at a α/2 level of significance if:

3.A.2.2.4 Sample size calculation
Let the sample sizes of the two treatment groups be n LRS_Ex and n LRS_St with Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models Assuming π S_St = 0.65 and π S_Ex = π S_St = 0.65 (or R T_S = 1), and having fixed the non-inferiority margin L R0_S = -0.1226, a significance level (α) at 0.05 two-sided leading to z 1-α/2 = 1.96 and a power of 0.90 leading to a z 1-β = 1.2816 with (z 1-α/2 + z 1-β ) 2 = 10.507971, the sample size for the Experimental group, which is equal to that of the Standard group (k = 1), is easily obtained from the formula 3.A.2.1.2as n LRS_Ex = 752.80≈ 753.
It has to be noted that the above conditions correspond to those of Example 3.A.1.4 of model 1 insofar as R 0_S is obtained δ 0_S by means of the pertinent conversion formula: R 0_S = (π S_St -δ 0_S )/π S_St and L R0_S = ln(R 0_S ) = ln(0.884645)= -0.1226.In this case, the sample size of 753 in each treatment group is a little lower than the 758 required by Model 2, and much lower than the 850 required by model 1.

3.A.2.2.5 Power Calculation
Under the alternative hypothesis that L RT_S > L R0_S , the power of the above test is: Alternatively, Formula 3.A.2.1.4can be straightforwardly solved using z 1-β , after which the corresponding probability value can be calculated.

3.A.3. Third Model: Odds ratio (OR S ) of two success probabilities and its (natural) logarithm (LOR s ).
The comparison of two success proportions can also be expressed using the odds ratio (OR S ), with the true odds ratio (OR T_S ): and OR 0_S as its non-inferiority margin.

3.A.3.1 Null (H 0 ) and alternative (H A ) hypotheses
The non-inferiority hypotheses in terms of the odds ratio are: with 0< OR 0_S <1 as it is expected that π S_Ex <π S_St .
Of course, the actual limits are the values compatible with the non-inferiority settings and, consequently, the "clinically or biologically irrelevant difference".
The sample odds ratio (OR S ) is given by: However, it is better to consider its natural logarithm, given by LOR S = ln(OR S ) because of its more suitable distributional properties: consequently, we have LOR T_S = ln(OR T_S ) and LOR 0_S = ln(OR 0_S ) with the hypotheses: As 0< OR 0_S <1, LOR 0_S is < 0.

3.A.3.2 Sampling distribution
It is necessary to consider the shifted sampling distribution of: The following formulae show the expected value E(LOR S ), with the standard deviation (σ LORS ) being calculated using the delta method: By expressing  S_Ex from the known parameters of the model (OR T_S and  S_St ), we obtain: Assuming that the distribution of the LORs for a large sample size can be approximated using a standardised Gaussian curve [Z(0,1)], we have:

3.A.3.3 Statistical test
The null hypothesis is rejected at the α/2 level of significance if:

3.A.3.4 Sample size calculation
Let the sample sizes in the two treatment groups be Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models which respectively become under H 0 and H A : From the general formula (2.A.3), we obtain: , it is easy to obtain a simpler formula.) 2 = 10.507971, the sample size for the experimental group, which is equal to that of the standard group (k = 1), is easily obtained from the formula (3.A.3.3) as nLOR S_Ex = 920.64≈ 921.This is larger than the 850 calculated using the difference (Model 1), and much larger than the 758 calculated using the ratio (R S , model 2) or the 753 obtained from the LR S (model 2.1).

3.A.3.5 Power calculation
Under the alternative hypothesis that OR T_S > OR 0_S , the power of the above test is: Otherwise (and equivalently), Formula 3.A.3.4 can be straightforwardly solved using z 1-β , after which the probability value associated with this quantile can be calculated.

3.A.4. Success probability tables
Table 1 shows the H 0_S hypothesis, sample distributions and sample size calculation formulae of the three models, with the second model being divided into the ratio (R S , model 2.1) and logarithm of the ratio models (LR S , model 2.2), and OR S and LOR S being considered together as the third model.
Table 1.1 shows the sample sizes calculated, using the three methods of Farrington and Manning [38] for R T_S values ranging from 0.85 to 1.0 at intervals of 0.05 and π S_St values ranging from 0.1 to 0.9 at intervals of 0.2, assuming that α = 0.025, 1-β = 0.80, and the non-inferiority margins (expressed as R 0_S in order to be consistent with Laster et al.) [49] are 0.8, 0.85 and 0.95.

SWITCHING NON-INFERIORITY MARGINS FROM ONE MODEL TO ANOTHER: COMPARISON OF SAMPLE SIZES
We propose a general method that allows to switch from one model to another, valid for all four models for both successes and failures.As a consequence, the hypotheses of a model are re-parametrized in those of another model, obtaining the corresponding non-inferiority margins.To maintain consistency in definitions and approaches it is necessary to place the general constraint that the NI margin of the final model is calculated only by the NI margin of the starting model and by π S_St which is independent of the NI margin and considered "known" in the planning phase of a study.

4.A. Success probability
Switching the calculation of non-inferiority margins from one model to another is based on a fixed π S_St , which is considered as a "known" during the planning phase of a trial in the same way as the true difference (D T_S ), the true ratio (R T_S ) and the true odds ratio (OR T_S ).
Table 3 shows the switching formulae of the three models.

4.A.1.1 Model 1 (difference/delta: D S ) vs model 2.1 (ratio: R S ) and model 2.2 (ln ratio: LR S )
Starting from the non-inferiority H 0_S hypothesis concerning the difference between two success probabilities ( ), the last term of the last inequality is obtained by dividing both terms of the first inequality by π S_St ,: Thus, by putting , it is possible to calculate the non-inferiority margin for the ratio of two success probabilities.
Then, using and , the statistical hypotheses can be formulated in the terms of a ratio.In the case of H 0_S , it can be written: Alternatively, starting from the non-inferiority hypotheses of the ratio of two success probabilities (R 0_S ), it is necessary to fix the value of π S_St in order to obtain the non-inferiority margin in terms of a difference:

(LR S ).
It should be noted that the constraint 0<R 0_S <1 implies that 0< 0_S <π S_St and vice versa, thus leading to a 1:1 correspondence between the two models; however, the upper limits are only theoretical, and have no sense in the context Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models Alternatively, from π S_St =0.8 and R 0_S =0.9375 or LR 0_S =-0.06454, it is straightforward to calculate  0_S = 0.05 by inverting the above equation as: In this way, the formulated hypotheses of model 1 (D S ), model 2.1 (R S ) and model 2.2 (LR S ) are equivalent and, furthermore, as LR 0_S =ln(R 0_S ), it is extremely simple to switch from model 2.1 (R S ) to model 2.2 (LR S ), and vice versa.

4.A.1.2 Model 1 (D S ) vs model 3 (odds ratio: OR S and ln(OR S ))
Let us consider the following chain of inequalities: By putting: and: we can obtain the formulation of the statistical hypotheses in terms of the OR.In the case of H 0 , we have: Otherwise, from model 3, we can consider the hypotheses for model 1 by putting: It should be noted that the constraint 0< OR 0_S <1 implies 0<R 0_S <1 and vice versa, and so there is a 1:1 correspondence between the two models in these intervals.Furtherly, the same applies in the case of the logarithm transformation: Once again, it has to be noted that the upper limits are only theoretical, and have no sense in the context of noninferiority studies or clinical trials in general.

4.A.1.3 Models 2.1 (ratio: R S ) and 2.2 (ln(ratio, LR S ) vs model 3 (odds ratio: OR S and ln(OR S ))
Let us consider the following chain of inequalities: Consequently, we obtain the following H 0 formulation in terms of OR S : : : Note that the constraint 0<R 0_S <1 implies 0<OR 0_S <1 and vice versa, and so there is 1:1 correspondence between the formulated hypotheses of the two models, with the same consideration applying to the upper limits.

4.A.2. Comparison of the sample sizes calculated using the models
Comparisons of the sample sizes calculated using the three models with α and 1-β fixed require the success probability values, the non-inferiority margin, and the method used to estimate the probability of variance under H 0 .Unfortunately, it is Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models not possible to provide a universally valid rule for choosing the approach leading to the smallest sample size, but we have found a pattern of inequalities in non-inferiority margins that is valid asymptotically and over a clinically relevant interval (see below).
However, it is very easy to calculate sample sizes using a computer program that implements all of the formulae, and therefore choose the most parsimonious approach.To this end, it is useful to use sample size curves in function of π S_Ex , at fixed values of statistical significance (α), power (1-β), π S_St , and non-inferiority margins for the different methods of estimating probabilities.Furthermore, using method 3 and  0 = 0.05, the sample size calculated for the R S model is about 88% of that calculated for D S when π S_St = 0.4, 90% if π S_St = 0.5, and 96% if π S_St = 0.9.Finally, using method 3, the sample sizes for the LR S model are always less than those calculated for the D S model, and always a little more than those calculated for the R S model.

B. Asymptotic behaviour study
When π S_Ex under H A tends to its lower limit, which is π S_St - 0 in the case of model 1 (D S ), or when the sample sizes tend to +∞ at a fixed k, non-inferiority margin and πS_St, the following chains of inequalities are valid (see Appendix 3.A.

C. Graphical comparisons of the sample sizes obtained using the models
Further results can be obtained using sample size curves for fixed values of the other parameters and varying values of π S_Ex over the clinically relevant interval, with the limits given by the extreme values of non-inferiority (π S_St - 0 ) and π S_St , corresponding to the equipoise condition.
We only show the sample size curves for π S_St = 0.3, π S_St = 0.5 and π S_St = 0.7, with  0 =0.15 (because a large noninferiority margin provides a better vision of the sample size curves of the three models) for α = 0.05, 1-β = 0.80, using method 3.In the case of π S_St =0.3, in addition to the fairly parallel pattern of the sample size curves, it is possible to see that: (i) the sample sizes of the D S model are the largest, with those of the LOR S model becoming very similar (about 96%) to the values of the D S model if π S_St = 0.5, and larger if π S_St = 0.7; (ii) the sample sizes of the LOR S model are always more than those of the R S and LR S models; and (iii) the sample sizes of the R S and D S models become very similar at the highest values of π S_St .
Figures A.1, A.2, and A.3 show that, in addition to being asymptotically valid, the structure of the inequalities is valid over a clinical relevant interval with the equalities being replaced by approximations.
In addition, the pattern of relationships remains substantially the same if the non-inferiority margins are changed; what changes is the entity of the differences in sample sizes.Finally, changing the methods of estimation does not lead to any evident changes in the relationships except in the case of the R S and LR S models for which n RLS_Ex < n RS_Ex in the case of methods 1 and 2, and n RLS_Ex > n RS_Ex in the case of method 3, but the differences are only of a few units.

SWITCHING FROM SUCCESS TO FAILURE PROBABILITIES, AND VICE VERSA
In order to be able to enrol as few patients as possible and ensure the most favourable parameterisation, it must be possible to consider that the primary outcome of the trial might be negative, which can be done using the same approach and assumptions as those used when switching from one model to another.

Sample sizes for the different success and failure models
Comparing the sample size formulae for the models of success and failure, we have:

Model 1: D S and D F
It is possible to demonstrate that n DS_Ex = n DF_Ex because, at their respective sample sizes, the following expressions at the numerator: Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models and at the denominator: are the same.

Model 2: R S , LR S , and R F , LR F
In the case of the models R S and R F the denominators of the sample size formulae are equal, but the numerators are different because R 0_S ≤1 ≤ R 0_F .Therefore, except in the non-sensible case of R 0_F = R 0_S = 1, we have n RS_Ex < n RF_Ex , which means that a success-based approach is preferable.
In the case of the LR models, given that n LRS_Ex ~ n RS_Ex and n LRF_Ex ~ n RF_Ex , it can be concluded that n LRS_Ex < n LRF_Ex

Model 3 (LOR S ) and OR F (LOR F )
It is possible to demonstrate that n LORF_Ex = n LORS_Ex because, by definition: and, consequently, n LORF_Ex = n LORS_Ex .

CONCLUSIONS
In the case of models (D) and OR (LOR), the sample sizes are equal regardless of whether we are considering the success or failure probability; in the case of model R (and LR), lower sample sizes are obtained by using success probabilities.
Given the sample sizes calculated in function of  S_Ex varying over the clinically relevant interval, the following inequality chains apply: (<*means that the difference is only a few units).
It has to be stressed that the above inequalities (with model R S as the best followed by LR S ) come from method 3 (constrained MLE), which performs better than method 1 of Blackwelder [35] or method 2 of Dunnett and Gent [3] in terms of controlling the type I error probability, power, and confidence interval coverage.Furthermore, the difference between these two models is practically eliminated using method 2, and reversed using method 1 (model LRS is the best, even if by only a few units).
The table below shows the sample sizes for the four models.The table below shows the sample sizes for the four models:

DISCUSSION
Biomedical research has to be adequately empowered by appropriate sample sizes for economic, organisational,  Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models logistic, scientific and, mainly, ethical reasons (even if it is practically impossible to separate the ethical and scientific aspects of biomedical research).In addition, the feasibility of a trial mainly depends on the sample size that it is possible to enrol.
It is important that a research study is adequately powered with the smallest possible number of subjects, particularly in noninferiority settings in which it is easy to increase the non-inferiority margin in order to obtain a smaller sample size.However, a too small sample size in a non-inferiority setting not only fails to demonstrate that the experimental drug is non-inferior to the optimal standard treatment, but also fails to demonstrate that it is superior to a placebo or a previous standard as it can be taken granted that the demonstrated non-inferiority of the experimental drug also leads to the demonstrated superiority of the experimental drug over a placebo or the previous standard.
The search for a statistical approach that leads to the most parsimonious but adequate sample size is particularly important when comparing two probabilities for which different parameterisation models, testing procedure and sample size calculation formulae are available.We have shown the statistical models, the methods of estimating variance under H 0 , and sample size calculation formulae separately for success and failure probabilities, and described a method of consistently switching among the models and probabilities in order to choose the most parsimonious approach.To this purpose, the coherency of the formulations is kept by the general constraint that the NI margin of the final model is calculated by using only the true probability of the standard (π St ) which is independent of the non-inferiority NI margin and considered "known" in the phase of planning the study.
We have also demonstrated that, asymptotically, there is a hierarchical structure of inequalities among the sample sizes of the different models, and verified that it does not change under H A within the range of clinically plausible values for noninferiority settings.
We confirm that the sample sizes for the R S model are smaller in the case of success probabilities as has been previously shown by Laster et al. [49].However, it has to be pointed out that the greater efficiency of the R S model is not maintained in the case of failure probabilities, for which the sample size of the R F model is greater than that of the D F model, a result that it is the opposite of that described by Laster et al. [49].It has to be said that Laster et al. [49] obtained their result by reversing the order of the ratio between Experimental and Standard used in the case of success probability in order to ensure that it remained less than 1.However, this reversal and a different formulation of H 0 does not lead to a single inferiority margin because it depends on π F_Ex , as shown in Appendix 1.2.
We have also shown that each success model has an equivalent model for failure.In the case of the D S and D F and LOR S and LOR F models, the sample sizes are the same, whereas.the sample sizes of the R S and LR S models are always smaller than those of the R F and LR F models.It is thus possible to establish a hierarchical structure of sample sizes for the eight equivalent models within the clinically relevant interval of π S_Ex (π F_Ex ) under H A when all of the other parameters are fixed.
Furthermore, the odds ratio model leads to a larger sample size and, consequently, is not to be preferred even if an effect size or a non-inferiority margin expressed on the basis of this parameterisation might seem to be sensible.
The most sensible approach is to consider each case separately by calculating the pertinent sample size curves over the pertinent interval of clinical non-inferiority by using the usual parametrization in a particular clinical setting, and then choosing the one that leads to the most parsimonious sample size.TABLE 1. Success Probability.Null Hypothesis (H 0 ) of the three considered Models (M), together with their sampling distribution, and sample size calculation formulae for the experimental group.Table 1.1 Success Probability.Sample sizes for  = 0.025 and 1- = 0.80

Legend: π S_St = true success probability for the Standard drug, R 0_S = non-inferiority margin expressed in the ratio scale, M = Method 1, 2, and 3 (see text); R T_S = true ratio between the true success probability for the Experimental drug (π S_Ex ) and π S_St ; D S = Difference, R S = Ratio, LRS = ln(RS), LOR S = ln(Odds Ratio). The "-" sign means that it is a case incompatible with non-inferiority and the "." sign means that the denominator of the sample size formula is equal to 0 (sample size tends to infinity)
Table 2. Success Probability.Formulae for switching from a model to another of the three considered models APPENDIX 1 The term θ = n 2 /n 1 of Farrington and Manning's formula 4 [38] is replaced by ϕ = n/m = n 2 /n 1 , where n 1 is the sample size of the standard group and n 2 the sample size of the experimental group; in addition, in Farrington and Manning's notation [38], p 1 corresponds to π 1 (π _St in this paper) for the Standard, and p 2 to π 2 (π _Ex , in this paper) for the Experimental.Finally, the maximum likelihood estimates in the formulae are indicated with the subscript of a bar [44,45] or a tilde [38].Eliminating θ at the denominator of the second terms under the square roots in Farrington and Manning's formula [38] means that its corresponding ϕ is the multiplier of the first terms of the square roots at the numerator and the denominator in Machin et al.'s formula.
When the denominators of the two formulae are equal, as can be expected in the case of success probabilities, the numerators are equal and the sample sizes for the Standard drug are the same (n 1 or m,) but, when the calculation is for an unbalanced allocation with θ ≠ 1 or ϕ ≠ 1, the results are different.This is because Machin et al.'s formula [44,45] for calculating the coefficients of the cubic equation that gives the maximum likelihood estimates wrongly uses the reciprocal of ϕ (defined as n/m = n 2 /n 1 ), as can be seen by the value of the "b" coefficient in Farrington and Manning's equation (b FM ) [38]: On the contrary, the b coefficient in Machin et al.'s formula (bM) [45,46], with ϕ = n 2 /n 1 , is: Furthermore, and even more clearly, the "a" coefficients (a FM and a M ) are: In conclusion, in the case of an unequal allocation and in order to obtain the same results, Machin et al.'s formula [44,45] has to be used with maximum likelihood estimates calculated according to Farrington and Manning [38] or used with the reciprocal of ϕ (ϕ = n 1 / n 2 ).
Indeed, in Machin et al.'s equations 5.4 and 5.5 [44, page 101] and equations 9.10 and 9.11 [45, page 109], what needs to be multiplied by the sample size ratio (ϕ) is the standard error of the maximum likelihood estimate of the experimental probability of success ( and π 2 ,) and not the standard probability of success ( and π 1 ); finally, ϕ has to be deleted at the denominator.However, the above formulae have been corrected in the last (4 th ) edition of Machin et al.'s book [46].

Laster et al.'s approach to failure probabilities
Laster et al. [49] calculated the non-inferiority margin of the difference between two failure probabilities from the noninferiority margin of the relative risk (and vice versa) by exchanging the role of π F_Ex and π F_St and defining R T_F as π F_St / π F_Ex .The H 0 and H A of the failure probabilities are therefore formally equal to those of the success probabilities and, consequently, it is necessary to use the same sample size formulae as those used for the success probability.The non-inferiority margin defined by Laster et al. [49] is: a formula that corresponds to that used by us in the case of success probabilities, with π S_St being replaced by π S_Ex .The asterisks at the apex indicate that these quantities are different from those referred to in this paper and are pertinent to only this demonstration.
It should be noted that R* 0_F depends on π F_Ex which, unlike π F_St or π S_St , does not have only one well-defined value under H 0 and H A as the values of π F_Ex (or π S_Ex ) under H 0 depend on the non-inferiority margin and the true (optimally zero) difference between the standard and experimental probabilities under H A .This leads to different sample sizes and powers.
However, it is possible to obtain a non-inferiority margin for differences (model 1) that only depends on the known values of π F_St (or π S_St ), which are equal under H 0 and H A .
From the H 0 of the ratio between two probabilities (model 2.1, appropriately called "relative risk" in the case of failure probabilities), we have the following chain of inequalities: This non-inferiority margin corresponds to that shown in Table App Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models as R* 0_F =1/R 0_F , it is possible to see that  0_F =(R 0_F -1)π F_St (second row, second column, second formula) by straightforward algebra.
In addition, under H A , this margin is always larger than that shown by Laster et al. [50], thus leading to lower sample sizes: e.g. according to Laster et al. [50], with α = 0.05, 1-β = 0.8, and an equal allocation in the two groups, π F_St = 0.1, we obtain: R* 0_F =1/ R 0_F giving: This result corresponds to that shown between brackets in the first row of However, on the basis of our conversion formula, we obtain: In the case of failure probability, Laster et al. change their definition and approach [50, page 1116] adopted for the success probability for which they stated that the non-inferiority margin is "a high percentage or fraction (R LB ) of π St (R LB <1)".However, this seems to be inconsistent insofar as (R LB ) becomes a high percentage or fraction of π Ex (π F_Ex , in our notation) and, consequently, the non-inferiority margin  0_F of model 1 (D) does not depend on π F_St , but on π F_Ex .
It is also necessary to consider that if, under H A , π F_St = π F_Ex (as is very sensible),  * 0_F =  *

0_F
. Consequently, it does not seem to be consistent that the maximum ratio different from 1 under H A (R T =0.8) and the maximum non-inferiority margin in terms of a ratio (R LB = 0.5) both translate into the very small difference of 0.0625, and it would seem to be more reasonable to obtain our larger difference of 0.1.
Finally, using our approach, it is possible to show that applying the values of the non-inferiority margin obtained directly from a success model to the failure model or vice versa is consistent.This view is also indirectly supported when switching from success model 1 to success model 2.1 and to failure model 2.1 and, finally, to failure model 1.This consistency cannot be demonstrated using Laster et al.'s approach [49] because the settings of success and failure are kept separate.

Formulation of the H0 and HA hypotheses
The general methodology is the same as that used for the success probability, except for the formulation of the H0 and HA hypotheses, in which the direction of the inequalities is reversed.Using the subscript "F" for failure, these are: for inferiority for non-inferiority What follows are the differences from the results obtained in the case of success probability.

Statistical significance test
Given the above H0 formulation, the non-inferiority statistical significance test will always be onesided (on the left) with the test function given by: (2.1.2) With tF as the sampling value of TF and a significance level of a = 0.05 two-sided (or equivalently, 0.025 one-sided), H0 will be rejected if z<za/2 or tF<tc, where tc is the quantile that delimits the critical region.The rejection of the null hypothesis indicates the non-inferiority of the Experimental; however, using the usual approach, the non-inferiority H0 hypothesis is rejected if the upper limit of the 95% confidence interval is lower than the non-inferiority positive margin.

Sample size calculation
The rationale underlying the sample size calculation is based on the simultaneous occurrence of two events: obtaining a statistically significant result (under H0) and the rejection of H0 under HA: Solving the above inequalities for tc gives: Finally, by equating the above expressions to tc, the sample size can be calculated using the following general pivotal formula: (2. 1.3) which has to be explicitly solved for the sample size (nS_Ex) of the Experimental.
Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models 2 The general formula for calculating the sample size of the Experimental is obtained using the algebra shown in paragraph 2.A.3: (2.1.3.1) where The above formula has to be appropriately adapted to the parameters of the considered parameterisations and, in order to allow an unequal allocation, the ratio of the two sample sizes (k = nF_St / nF_Ex) has to be calculated.The difference in the denominator is inverted but, as it is squared, the result is the same as that obtained from Equation 2.A.

Models for the Comparison of two Failure proportions
As there are a number of overlaps with the theoretical results shown in section 3A, we shall only consider the differences.The probability of failure of the Standard and Experimental are respectively indicated as pF_St and pF_Ex and, as shown above, the sample probabilities are binomially distributed.

Null (H0) and alternative (HA) hypotheses.
Although we maintain the convention of writing the failure probability of the Experimental and Standard in that order, it must be remembered that the inequalities are different.For example, in the case of model 1 (DF), they are: The non-inferiority margin d0_F is a value that is appropriate in this setting, and generally fixed at a suitable fraction of pF_St.The other non-inferiority margins also have different limits: R0_F>1 in the case of model 2.1 (RF), LR0_F>0 in the case of model 2.2 (LRF), and OR0_F>1 in the case of model 3 (ORF), which is only considered as LOR0_F>0.

Sampling distribution
As in the case of success probabilities, it is possible to formulate the null and alternative hypotheses, determine sample distributions, and derive the formulae for power and sample sizes for each of the three models.The sampling distributions of the failure models are the same as those shown in part 3.A, except for the fact that model 1 (DF) has -d0_F instead of +d0_F

Statistical testing
In accordance with the null hypothesis, the statistical tests are one-sided on the left tail of the distribution (instead of being on the right tail as in the case of successes).

Sample size calculation
The formulae for the sample size calculation shown in Table App.2.1 are the same as those obtained in the case of successes, except for the difference model (DF), which has -d0_F instead of +d0_S

Power calculation
Once the sample size has been established as described above, it is once again possible to calculate the power by deriving an ad hoc formula as shown in the case of success, or by solving the sample size calculation formula for z1-b, and then calculating its corresponding probability..00/0.90, 1.05 = 1.00/0.95,and 1.0), and pF_St values ranging from 0.1 to 0.9 at intervals of 0.2, assuming a = 0.025, 1-b = 0.80, and that the non-inferiority margins (expressed as R0_F) are 1.25, 1.15 and 1.05 (which are considered to be suitable for non-inferiority studies) using the three methods of estimating probability.For example, with pF_St = 0.50, pF_EX = 0.50, giving RT_F = 1.00, d0_F = 0.125, a = 0.025 and 1-b = 0.80, it is first necessary to calculate R0_F = 1 + d0_F / pF_St = 1 + 0.125/0.5 = 1.25.It is then possible to read that nDF_Ex = 251, nRF_Ex = 322, nLRF_Ex = 315, and nLORF_Ex = 241 in the row with pF_St = 0.5 and R0_F =1.25 and M = 1 (for method 1) in the columns corresponding to RT_F = 1.00.The subsequent two rows show the sample sizes calculated using methods 2 and 3, and it is possible to see that, nRF_Ex = 319 < nLRF_Ex = 326 using method 3. It should be noted that these sample sizes become 809, 1,038, 1,018, and 763 when pF_EX = 0.555 giving RT_F = 1.11.

Switching non-inferiority margins from one model to another
It is also possible to calculate the pertinent switching formulae for the failure probability (see Table App.2.3) following the same theoretical approach as that used in the case of success probability and starting from their different null hypotheses (H0); once again it is the standard probability (pF_ST) that plays a pivotal role.
It is worth pointing out that, obtaining the non-inferiority margin of model 2.1 from model 1, we have: for the success probability, and for the failure probability.
The formulae for model 1 converted from model 3 are: for the success probability, and for the failure probability.The same considerations apply in the case of switching from model 1 to model 2.2.Switching from model 2.1 to model 2.2 only requires changing R0_F to LR0_F.In addition, it is possible to switch from model 1 to model 2.2 (LRF) and vice versa by using: The switch from ORF to LORF parameterisation needs no explanation.)( ) ( )

Models 2.1 (ratio: RF) and 2.2 (ln(ratio: LRF) vs model 3 (odds ratio: ORF and ln(ORF
In the case of pF_St = 0.3, in addition to the fairly parallel pattern of the sample size curves, it is possible to see that: i) the sample sizes of the DF model are the smallest, with those of the LORF model becoming very similar (about 96%) to the values of the DF model in the case of pF_St= 0.5, and even smaller in the case of pF_St= 0.7; and ii) the sample sizes for the LORF model are always smaller than those of the RF and LRF models.In addition, the pattern of relationships remains substantially the same if the non-inferiority margins are changed; what changes is the entity of the differences in sample sizes.
Finally, changing the methods of estimation does not lead to any evident changes in the relationships except in the case of the RF and LRF models, which give sample sizes that differ by only a few units using method 3, are practically equal using method 2, and reverse their relationship with nRLF_Ex < nRF_Ex using method 1, but, once again, with differences of only a few units.and so: In particular: Including these quantities in the general sample size formula of the LR model, leads to: The conclusion is therefore that nLRS_Ex is asymptotically equal to nRS_Ex and, in practical terms, the difference is only a few units.
Epidemiology Biostatistics and Public Health -2020, Volume 17, Number 1 Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models 10 3.A.3.Asymptotic behaviour of the ratio between the sample sizes of the models A general formula for the sample size calculation of a generic "model T" is: with: and the ratio between the sample sizes of two generic models (T1 and T2) is: The asymptotic behaviour of this ratio can be obtained for HA à H0 (µT1_HA and µT2_HA, which tend to their non-inferiority limits of respectively µT1_H0 and µT2_H0), and consequently for à 0 and DT2 à 0 (herein DT à0): Note that, if , then: and so: The above approach can also be applied to pS_Ex_HA which tends to the non-inferiority limit given by pS_Ex_H0, and to pS_St_HA à pS_St_H0 Section 2.A.3.1 gives an example of the application of this formula to the models LRS (T1) and LORS (T2).3.A.3.1.Calculus of the limits of the ratio between the sample sizes of models LRS and LORS 3.A.3.1.1Calculus of (limit of ) As the numerator is always less than the denominator, the ratio is <1.

Conclusions
The ratio limits do not depend on a or b, but only on pF_St, k, and the non-inferiority margin.Then, having fixed k, the non-inferiority margin, and pS_St, the following relations apply: LDF/RF>1, LRF/LRF= 1 and LLRF/LORF<1 from which it is possible to obtain those relating to sample sizes: 1)-nDF_Ex < nRF_Ex = nLRF_Ex and nLORF_Ex < nRF_Ex= nLRF_Ex, regardless of the value of pF_St; 2)-nDF_Ex ≤ nLORF_Ex< nRF_Ex = nLRF_Ex when pF_St ≤ (1-d0)/2, or nLORF_Ex < nDF_Ex < nRF_Ex < nLRF_Ex when Sample sizes for non-inferiority studies based on the difference between two proportions: a unified approach for difference, ratio and odds ratio models

µ
H 0 and H A , become respectively: Applying formula (2.A.3) at an /2 significance level and (1-) power gives: (3.A.1.4)The denominator of the second term is due to the fact that the expected value of D S _ H 0 A.1.4.Bis) A further simplification is obtained when k = 1 and S Ex S St π = π _ _ , because the denominator is given by  0_S squared.The pertinent probability estimates for the sample size calculation have to be entered in the formula.Formula (3.A.1.4)allows different expected success proportions for the Experimental and Standard, and both Formulae (3.A.1.4)and (3.A.1.4.Bis) allow a different sample size for any k, although k <1 in clinical research as the imbalance is due to randomising more patients to the Experimental in order to obtain more precise estimates.
accordance with the second formulation of the above H 0 and H A , we consider a sampling distribution of 0 an α/2 level of significance is rejected if: (3.A.2.1.3)3.A.2.1.4.Sample size calculation Let us indicate the sample sizes of the two treatment groups as n RS_Ex and n RS_St with A.2.1.4.Bis) H 0 and H A respectively become: From the general formula 2.A.3, we obtain: 3.A.2.2.4 Again, in the case of S Ex H is straightforward to obtain the simpler sample size calculation formula.Example 3.A.2.2.4 (the estimates are obtained using Farrington and Manning's method 3 [38].)e13265-12 ORIGINAL ARTICLES Epidemiology Biostatistics and Public Health -2020, Volume 17, Number 1 Epidemiology Biostatistics and Public Health -2020, Volume 17, Number 1

Figures 5
Figures 5.3.1.1 and 5.3.1.2show the sample size curves relating to the example above, which confirm the inequality chains for the success and failure probabilities separately.

1. 1 .
Machin et al.'s formulaMachin et al.'s formula for calculating sample sizes for non-inferiority studies of the difference between two proportions (5.4,Chapter 4, page 101[44] and 9.10, Chapter 9, page 109[45]) needs to be corrected in the case of an unequal allocation (ϕ ≠ 1).Except for the absolute value of the difference π 1 -π 2 at the denominator, the formula: corresponds to Farrington and Manning's formula 4[38, page 1449] .2.3 Failure Probability of the supplementary material, Epidemiology Biostatistics and Public Health -2020, Volume 17, Number 1

Figure 2 . 7 Figure 2 .B. 4 . 2 :
Figure 2.B.4.1: pF_St = 0.3 Sample size curves for pS_St = 0.3 with d0=0.15 (because a large non-inferiority margin provides a better vision of the sample size curves of the three models) for a = 0.05, 1-b = 0.80 in function of pS_Ex (ranging from 0.30 to 0.45) using method 3. The curves are for the Difference (D), Ratio (R), Logarithm of the ratio (LR), and the Logarithm of the Odds Ratio (LOR).
Table App.2.1 shows the H0_F hypotheses, sample distributions, and sample size calculation formulae of the three models, with the second model being divided into RF (model 2.1) and LRF (model 2.2), and the third model considering ORF and LORF together.The sample size calculation formulae are numbered 3.B.1.4,3.B.2.1.4,3.B.2.2.4, and 3.B.3.4 to match the corresponding formulae for the success probability.Table App.2.2 shows the sample sizes calculated for some values of RT_F (1.18 = 1.00/0.85,1.11 = 1 Considering LRs as a two-variables function of pS_Ex and pS_St, and applying a first degree Taylor series expansion, starting from pS_Ex_H0 and pS_St_H0, we obtain: