Parametric and nonparametric two-sample tests for feature screening in class comparison: a simulation study

Background: The identification of a location-, scaleand shape-sensitive test to detect differentially expressed features between two comparison groups represents a key point in high dimensional studies. The most commonly used tests refer to differences in location, but general distributional discrepancies might be important to reveal differential biological processes. Methods: A simulation study was conducted to compare the performance of a set of two-sample tests, i.e. Student's t, Welch's t, Wilcoxon-Mann-Whitney (WMW), Podgor-Gastwirth PG2, Cucconi, Kolmogorov-Smirnov (KS), Cramervon Mises (CvM), Anderson-Darling (AD) and Zhang tests (ZK, ZC and ZA) which were investigated under different distributional patterns. We applied the same tests to a real data example. Results: AD, CvM, ZA and ZC tests proved to be the most sensitive tests in mixture distribution patterns, while still maintaining a high power in normal distribution patterns. At best, the AD test showed a power loss of ~ 2% in the comparison of two normal distributions, but a gain of ~ 32% with mixture distributions with respect to the parametric tests. Accordingly, the AD test detected the greatest number of differentially expressed features in the real data application. Conclusion: The tests for the general two-sample problem introduce a more general concept of 'differential expression', thus overcoming the limitations of the other tests restricted to specific moments of the feature distributions. In particular, the AD test should be considered as a powerful alternative to the parametric tests for feature screening in order to keep as many discriminative features as possible for the class prediction analysis.


INTRODUCTION
Parametric and nonparametric two-sample tests are applied to a large number of high-dimensional continuous data for explorative studies in order to detect differentially expressed (DE) features (genes, miRNAs, metabolites) between different biomedical conditions, such as diseased (cases) and healthy subjects (controls).In particular, the tests are exploited as univariable feature ranking methods in class comparison [1], as well as a preliminary stepfeature screening -in class prediction [2].Such a screening is intended to identify promising features to be possibly included in a multivariable model, as the predictor or classifier, which aims at accurately predicting the class membership of a new sample based on a combination of expression levels of the selected features.
The most commonly used tests are the t test and the nonparametric WMW test, which refer to differences in terms of location and therefore are classified as location tests.However, feature distributions in the comparison classes may differ according to other aspects such as scale or, more generally, shape.One could test for location or scale changes (location-scale problem) or look for any changes in location, scale or shape (general two-sample problem) [3].Even small signals of general differences between the two classes could reveal discriminative features that should not be filtered out in the first phases of bioinformatics analyses, but further investigated in the following step of class prediction.Moreover, the asymptotic normality of the t-test statistic is often not fulfilled when dealing with some types of genomic data, mainly due to the small sample size [4], producing skewed, heavy-tailed or multimodal distributions of expression values.
In presence of such distributions, nonparametric alternatives to location tests, e.g. the Kolmogorov-Smirnov filter [5], could be more sensitive in feature screening, thus leading to a small number of false negative results.
In the field of high dimensional data, feature screening should not be tailored on specific distributional characteristics but rather be a flexible procedure, i.e. able to detect general differences between feature distributions under different patterns.Thus, a desirable test should prove to be robust in terms of Type I error control and powerful in a wide family of distributional patterns, even if not being the best one in every single situation.
The goal of this work was to compare via simulations different tests for class comparison of continuous data to draw suggestions for possible improvements with respect to the tests most commonly used in high-throughput data analysis.In particular, we conducted an extensive simulation study with sample sizes as small as those frequently encountered in the high dimensional data context [6] and including non normal distributions; a wide set of parametric and nonparametric tests for two class comparison was investigated according to size (i.e.type I error rate) and power, with the aim of possibly identifying a test to be used in the screening phase of high dimensional studies.Student's t and Welch's t tests [7] were used even if their assumptions are violated as they are the standard reference in practical applications.We investigated a series of nonparametric tests considering different alternatives versus the null hypothesis of equality between the Cumulative Distribution Functions (CDFs).Being aware that when the parametric tests assumptions are violated their power is deflated, our aim was to assess possible nonparametric alternatives and comparatively draw indications of possible improvements over the parametric tests.In particular, we implemented the following nonparametric tests: • the Wilcoxon-Mann-Whitney (WMW) test [8], detecting shifts in location between the CDFs; • two tests for the location-scale problem, i.e. the PG2 Podgor-Gastwirth (PG2) [9] and the Cucconi test [10][11]; the PG2 test has been recognised as the most powerful among the PG efficiency robust tests, while the Cucconi test represents the simplest and best performing alternative to PG2 [11]; • three chi-squared statistic-based tests, i.e. the Kolmogorov-Smirnov (KS) [12], the Cramer-von Mises (CvM) [13] and the Anderson-Darling (AD) [14] test; the KS test refers to the CDF maximum difference, the CvM test considers differences over the entire CDF range, while the AD test takes into account global CDF differences, granting more importance to the observations in the tails; the latter characteristic makes the AD test valuable when one is interested in finding also signals that are only present in a subset of patients; • the Zhang Z K , Z C and Z A tests, which are 'likelihood-ratio' based analogs of the 'traditional' KS, CvM and AD tests, respectively [3].
See the Appendix for further details on the considered tests.
As regards the simulation study, we chose to mimic the irregular pattern of the feature distributions of the two samples by sampling from mixtures of two normal distributions (NM), which should reproduce the coexisting presence of heterogeneous subpopulations underlying data, by varying the mixture parameters.We did not provide any adjustment for multiple testing.Indeed, the adjustment procedure after the tests itself is not the focus of the paper.

METHODS
A simulation study was conducted in order to compare the performance -in terms of size and power -of the tests described in the Appendix under different distributional patterns.
Following Burton et al. [15], data were simulated with resemblance to real continuous high dimensional data, replicating their irregular distributional patterns by sampling from mixtures of two normal distributions (NM) (Figure 1).
Let μ iA and μ iB be the means of the two components A and B of the mixture in the population i (i = 1,2), Parametric and nonparametric two-sample tests for feature screening in class comparison: a simulation study with μ iB -μ iA =sh i (shift), σ 2 iA and σ 2 iB be the component variances, μ i = λ i μ iA + (1λ i )μ iB the overall mixture mean, ] the overall mixture variance and λi the mixture weight, which is the probability associated with the first component of the mixture.
Finally, let δ = μ 2A -μ 1A be the difference between the first component means in the two populations.
Three main cases were simulated: A. two normal distributions; B. one normal and one mixture distribution; C. two mixture distributions; equal shifts sh 1 = sh 2 were also considered.The parameters δ, sh 1 , sh 2 were properly tuned over fixed ranges in order to simulate the different conditions under H 0 and H 1 .We considered four mixture weights λ Є {0.80; 0.95; 0.20; 0.05} and three small sample size settings, one balanced (m=20 vs n=20) and two unbalanced (m=20 vs n=40 and m=40 vs n=20) (Table 1).
We chose to consider a nominal significance level α = 0.05 and to perform B = 10000 simulations so as to obtain precise estimates derived via simulation.As an indicator of the simulation error we chose the standard error SE(p), with p indicating the nominal coverage probability [15].Finally, we calculated the relative frequencies of H 0 rejection of the tests; under H 0 such frequencies approximate the fixed nominal significance level α, while under H 1 they correspond to the empirical power of the tests.It was possible to simulate the null hypothesis patterns only when comparing two normal distributions with equal means (pattern I) or two perfectly overlapping mixture distributions, i.e. those with equal shifts (pattern IV).The robustness of the tests under H 0 was evaluated according to the indications given by both Conover [16] and Marozzi [17]: a test is considered robust if its Maximum Estimated Significance Level (MESL), i.e. its maximum relative frequency of H 0 rejection under H 0 (Table 2 -H 0 patterns) does not exceed a given threshold, typically 2α or 1.5α to be more restrictive.As regards the nonparametric tests, exact p-values have been computed for the KS test, while for the WMW, PG2, Cucconi, CvM and AD tests we report the asymptotic p-values.For the CvM test we used the p-values tabulated by Anderson et al. [13], since it has been shown that under H 0 the two-sample statistic has the same limiting distribution as that of the one-sample statistic [18].Moreover, using the Burr's formula, we can obtain the p-values corresponding to all the possible values of the CvM statistic and not limited to the tabulated ones; it is reported that such an approximation works to the fifth decimal place for values of the statistic between 0.42 and 2.2 and we empirically verified that, for values of the statistic greater than 0.10308, the formula approximates up to the second decimal place the p-values tabulated by Anderson et al. [13].Moreover, values of the statistic below 0.10308 correspond to p-values higher than 0.57, which are of no interest here since they indicate The rnormmix function from the mixtools package was used to simulate the mixtures of univariate normal distributions.Because of the computational burden of the simulation, parallel programming (using the two packages doParallel and doRNG) was implemented in order to perform simultaneous and reproducible computations.

The simulation study
A complete report of the simulation results is shown in the Supplementary material (Tables S2.1-S2.4).Given 10000 simulations, the chosen 5% significance level provided a SE equal to 0.2%, which is the simulation error for the H 0 patterns; as regards the power, in the worst situation when it is equal to 50%, the SE would be equal to 0.5%.Therefore, we got a precision of    (20,20).
As regards the power, we did not find a clear winner for all the patterns; in general, within the same category (location, or location-scale, or general two-sample problem) the tests shared similar power for all the considered patterns, except for KS and Z K which showed to be very conservative tests.Moreover, as expected, the tests for the general two-sample problem were generally more sensitive than those for the location-scale problem, especially when λ = 0.95.As an example to visualise the overall advantage brought by the general two-sample problem tests, we report the results in terms of power under the patterns I-VI, having fixed δ = 1 and m = 20 vs n = 20 and with λ = 0.80 (Figure 2) and λ = 0.95 (Table 3).
With both mixture weights, the two parametric location tests (Student's t and Welch's t) headed the power ranking in case of two normal distributions (pattern I) or when the second distribution was a mixture (pattern II); in the latter case their ability lied in detecting the observations in the tail of the mixture.However, when the two ECDFs were crossing, i.e. when the two distributions overlapped at certain points, their power collapsed (see Figures S1.1-S1.4 in the Supplementary material, patterns III, IV, VI).The location tests (Student's t, Welch's t and WMW) generally showed a high power for pattern V, where the two ECDFs of the mixtures appeared mostly separated, and thus the differences between the two samples were mainly in terms of location.The nonparametric location test, i.e. the WMW, was more powerful than the parametric tests in the patterns involving two mixture distributions, except for pattern V and λ = 0.80 (Tables S2.1.6,S2.1.7).However, it did not emerge as the best alternative to the parametric tests in presence of two mixture distributions, especially when λ = 0.80, where the advantage of the location-scale and general two-sample tests was more evident.
The location-scale tests (PG2 and Cucconi tests) showed the highest power in the patterns III and VI with λ = 0.80 (Tables S2.1.6,S2.1.8),corresponding to situations of scale differences being one distribution in the middle of the other one with ECDFs overlapping for the most part.Such tests seem to be particularly able to detect the differences in the peaks of the compared distributions.In general, the PG2 test was more powerful than the Cucconi test (both asymptotic and empirical versions) and the approximate empirical version of the Cucconi test was always more powerful than the asymptotic version.
In the patterns IV and V with λ = 0.80 and the patterns III-VI with λ = 0.95 the tests for the general two-sample problem were generally the most powerful ones.The most liberal tests were the AD test, its analog Z A test, the CvM test and its analog Z C test; in particular, for mixtures with equal shifts (pattern IV) the AD test was the most sensitive one, together with the Z C and CvM tests.Moreover, when the normality assumption was fulfilled, the AD, CvM, Z A and Z C tests had a limited power loss compared to that of the other nonparametric tests, while being very powerful in detecting any difference between the two samples in the remaining patterns.For example, the application of the AD test implied a loss in power of ~ 2% in pattern I, but a gain of ~ 32% in pattern IV with respect to the parametric tests (Table S2.1.6).It is worth to notice that, in spite of being tests for the general two-sample problem, the KS and its analog Z K proved to be very conservative, showing low power in all the simulated patterns.
With small weights to the first component of the mixtures (λ = 0.20 and λ = 0.05), we obtained similar results, being the AD, CvM, Z A and Z C the most sensitive tests in the III-VI patterns, while still maintaining a high power in the I and II patterns.Compared to λ = 0.80 and λ = 0.95, the power was higher for all the tests and, for patterns II, III and V, it often reached the 100%.Indeed, in these cases the mixture distribution density is concentrated at its second component, thus yielding well-separated distributions and easily detectable differences.
Regarding the unbalanced sample size settings (m=20 vs n=40 and m=40 vs n=20), the dominance of the general two-sample tests remained evident, except for the pattern V with m=20 vs n=40 and λ = 0.80, where the Welch's t test resulted as the most powerful test, even if the difference in power was small (~ 4%) with respect to the Z A test (Table S2.1.7).An explanation could be the presence of a large tail of the distribution of the second sample, corresponding to an evident difference between the two means (Figure S1.1, m=20 vs n=40, pattern V).

Van't Veer data analysis
We applied the considered tests to a real dataset included in the R Bioconductor package breastCancerNKI, containing gene expression data as published in Van't Veer et al. [19] and Van de Vijver et al. [20] (24481 genes/ features evaluated in 337 samples).We defined two classes according to the Estrogen Receptor (ER) status and matched 33 ER positive with 33 ER negative individuals.After a filtering at 100% level, 19264 genes remained.
We expected that the most conservative tests would detect less DE genes with respect to more liberal tests and, accordingly, the AD test identified the greatest number of DE features.
Details on the example at issue are reported in the Supplementary material, subparagraph S3.

CONCLUSION
Two-sample tests for class comparison are often used in bioinformatics and medicine for exploratory purposes, i.e. to detect DE features between different biomedical conditions.
The most commonly used are the location tests, such as the parametric t test and its variations, together with nonparametric tests such as the WMW test; however, they are able to detect only shifts in distributions and not to identify any other difference in scale or shape.These tests are often used in high dimensional studies for screening of features able to distinguish between two conditions.One drawback in applying the above tests is that the expression values may exhibit departures from normality and features could be DE in other aspects rather than location only.Indeed, they can miss features which are characterised by more general and subtle distributional discrepancies.These different signals might hide differential biological processes and have to be preserved in order to be further explored by including them in a multivariable model for class prediction.
We set a simulation study to evaluate the performance of different location tests (Student's t, Welch's t, WMW), tests for the location-scale problem (PG2 and Cucconi) and tests for the general two-sample problem (KS, CvM, AD, Z K , Z C , Z A ), by modelling the irregular signals by means of mixtures of two normal distributions.Although Z C in particular was suggested as the best one among the three Z K , Z C and Z A , we assessed the performance of all Zhang tests, since we wanted a complete comparison with the corresponding traditional tests (i.e.KS, CvM and AD).We did not find a clear winner for all considered distributional patterns among the tests proposed as an alternative to the most used ones.However, the simulation study and the application to Van't Veer's data showed that the tests for the general two-sample problem tend to save a greater number of DE features, with possible gain in power with respect to the location and location-scale tests.Location tests consider DE a feature with almost symmetric distributions in the two compared samples, while location-scale tests are able to detect also differences in terms of peak magnitude.The tests for the general two-sample problem go one step further introducing a more general concept of 'differential expression', thus overcoming the limitations of the above mentioned tests restricted to specific moments of the feature distributions.Specifically, the AD, CvM and their analogs Z A and Z C tests should be preferred since their power was very similar to that of the more efficient parametric tests when the normality assumption was fulfilled, while in all the other situations they still resulted very powerful in detecting differences between the two samples.The AD test in particular proved to be very sensitive in most of the simulated patterns; accordingly, by using the Van't Veer dataset, which represents a context similar to the simulated ones, the AD test resulted as the most 'saving-features' test to be further investigated.
In conclusion, the AD test should be considered as a powerful alternative to the parametric tests for feature screening in order to keep as many discriminative features as possible for the subsequent class prediction analysis.

The investigated two-sample tests
We can classify the investigated tests into three categories: I. location tests (Student's t, Welch's t, WMW); II.tests for the location-scale problem (PG2, Cucconi); III.tests for the general two-sample problem (KS, CvM, AD, Z K , Z C , Z A ).
All the tests were implemented using R software (http://www.r-project.org/).In the following, let x 1 ,... x m and y 1 ,...y n be the observations drawn from two independent random variables X and Y with continuous CDFs F and G, respectively.Also, let m + n = N.

Tests for the location problem
The location tests assess whether the centre of the data is the same for the two distributions.

Student's and Welch's t tests
Let μ ¹ and μ 2 be the means and σ 2 1 and σ 2 2 be the variances of the random variables X and Y.The null and alternative hypotheses of the considered parametric tests are: The t test is defined as ∼ and assumes that σ which is equivalent to: The null hypothesis is rejected for large values of CvM; asymptotic critical values are reported by Anderson and Darling [13].However, the distance between the two ECDFs tends to 0 when x → -∞ or x → + ∞ , thus the value of the CvM test statistic is rather insensitive to the differences in the distribution tails.We performed the test by implementing a user-defined function including the empirical correction formula reported by Burr [22], using the limiting distribution to approximate the exact distribution of the CvM test statistic (see the Supplementary material, subparagraph S4).

Anderson-Darling test
The AD test statistic is a modification of L 2 -CvM test statistic that, in order to give more weight to the observations in the distribution tails, includes a weighting function equal to the reciprocal of the variance of the ECDF (the latter is maximal around the median and minimal in the tails): simplification was introduced for computational purposes: where M i is defined as the number of observations in the first sample less than or equal to the i-th smallest in the pooled sample.The standardised statistic is obtained by using its exact mean (equal to 1 in case of two samples) and exact variance σ N , which was derived by Scholz [14]: The upper tail critical values for the aforementioned test statistic are reported by Scholz [14] and the null hypothesis is rejected for large values.The standardisation removes some of the dependence of the test on the sample size, as it was confirmed through a Monte Carlo study [14].For not tabulated critical values, an interpolation formula may be used to obtain the percentiles of interest.The test was performed using the adk.test function included in the adk package.

Zhang tests
All the considered Zhang tests (Z K , Z C , Z A ) derive from two types of test statistics [23] defined as: where Z t is the likelihood ratio test statistic and w(t) is a weighting function characterising the different tests.The Zhang tests are the analogs of the traditional tests KS, CvM and AD, which are obtained using the Pearson χ 2 test statistic as Z t .
-Zhang Z K test Z K is the analog of the KS test and it is obtained from Z max with w(t) = 1.The computational formula for the Z K test statistic is: where and H N denotes the ECDF of the pooled sample (k=1,...,N).Large values of the statistic guide to the rejection of H 0 .

TABLE 2 .V 6 VI 3 FIGURE 1 .
FIGURE 1. Example figure of two normal mixture distributions setting adopted into the simulation.μ iA and μ iB are the means of the two components A and B of the mixtures in the two samples i (i=1,2) with shifts shi = μ iB -μ iA (i=1,2), δ = μ 2A -μ 1A is the difference between the two mixture first component means and λ i are the two mixture weights of the A components, being their complement the mixture weights of the respective B components.
[13]nderson et al.[13], which involves the quadratic distance between the two ECDFs.Let H N (x) = mF m (x) + nG n (x)/N, being H N the ECDF associated with the combined sample.Then the CvM test statistic is defined as: ∼ Parametric and nonparametric two-sample tests for feature screening in class comparison: a simulation study test introduced