Continuity correction of Pearson’s chi-square test in 2x2 Contingency Tables: A mini-review on recent development

The Pearson’s chi-square test represents a nonparametric test more used in Biomedicine and Social Sciences, but it introduces an error for 2 x 2 contingency tables, when a discrete probability distribution is approximated with a continuous distribution. The first author to introduce the continuity correction of Pearson’s chi-square test has been Yates F. (1934). Unfortunately, Yates’s correction may tend to overcorrect of p-value, this can implicate an overly conservative result. Therefore many authors have introduced variants Pearson’s chi-square statistic, as alternative continuity correction to Yates’s correction. The goal of this paper is to describe the most recent continuity corrections, proposed for Pearson’s chi-square test


INTRODUCTION
Pearson's chi-square test or c 2 test is the nonparametric test commonly used by researchers in Biology, Medicine and Social Sciences. This test is based on the calculation of Pearson's c 2 statistic, introduced by Pearson K. [1], considering a sample of a population characterized by two o more dichotomous variables. For two dichotomous variables, it is possible to define a 2x2 contingency table, with the frequencies of occurrence of all combinations of their levels, considering a sample size equal to N, as it is shown in Table 1 In a 2x2 contingency table, Pearson's c 2 statistic is used to test the association between dichotomous variables, for example to individualize a possible association between variables such as sex (Male/Female) and smoke (Yes/No). For this scope Pearson introduce the chi-square statistic to evaluate the discrepancy between observed (O i,j ) and expected frequencies (E i,j ), where the observed frequencies are a, b, c and d of Tables 1. Instead the expected frequencies are defined  for every cell such as:   ,   , , i j N where i and j indicate the row and column index respectively. The formula to compute Pearson's c 2 statistic is described by Pearson K. (1900): where r 1 , r 2 , c 1 and c 2 i.e. the totals across rows and columns are generally called marginal totals.
Using the c 2 distribution to interpret Pearson's c 2 statistic requires one to assume that the discrete probability of observed binomial frequencies of 2x2 contingency table, can be approximated by the continuous c 2 distribution. This assumption is not entirely correct and introduces some error. To reduce the error in approximation, many authors introduced a continuity correction or variants of Pearson's c 2 test.
To reduce the error introduced by Pearson's c 2 statistic, Yates F. [2] suggested a correction for continuity that adjusts the formula for Pearson's c 2 by subtracting the value 0.5, from the difference between each observed value and its expected value for 2x2 contingency table. This correction reduces the c 2 value obtained and consequently increases its p-value. The formula to compute Yates's c 2 statistic in a 2x2 contingency table is: Unfortunately, Yates's correction may tend to overcorrect of p-value; this can implicate an overly conservative result, as reported by several authors [3][4][5][6][7].
The goal of this study is with literature review, to describe the most recent development about the continuity corrections by variants of Pearson's c 2 test defined for 2x2 contingency tables.

METHODS
In this section we introduce the most recent study about continuity correction of Pearson's c 2 statistic in 2x2 contingency tables.

Serra's continuity correction
Recently Serra N. [8] introduces a significant minimized of Pearson's c 2 statistic as a continuity correction of Pearson's c 2 test, for small samples (sample size ≤ 25). This approach is based on the observation that the denominator r 1 r 2 c 1 c 2 of (1), can be interpreted as a geometric mean. The formula to compute minimize Pearson's c 2 statistic in a 2x2 contingency table is: [3] Serra N., showed with a statistical approach, that for small samples (≤25), the minimized Pearson's c 2 statistic in 2x2 contingency tables, represents a continuty correction for Pearson's c 2 statistic more effective in comparison to Yates'continuity correction. Particularly in this study the author verify that, the Fisher's exact test [9,10], actually considered the "gold test" used when c 2 test is not appropriate, i.e. when the sample size is small and the expected values in any of the cells of a 2x2 contingency table are below 5, had performance statistically equal to c 2 Serra test.

Kajita Matchita et al.'s continuity correction
Kajita Matchita et al. [11] proposed a continuity correction to maintain a continuity value to be used when small expected cell frequencies on Pearson's c 2 test for independence exist in the research data. This correction method is used to control the type I error and obtained using a developed correction in more condition. For this scope the authors used a simulation study. The simulations were performed with Monte Carlo method, to evaluate the performance of their method in comparison to other continuity corrections such as Yates's correction and Williams's correction [12]. It shows an outperformed control of type I error, considering a pattern of data set at a significant level of 0.05 and 0.01, simulated contingency tables between 2x2 and 4x4 (2x2, 2x3, 2x4, 3x3, 3x4 and 4x4), a number of small expected cell frequencies up to 30% of the total cell used, a sample size between 5 and 10 times that total cell, and using 10,000 data set simulated by Monte Carlo method for each pattern. The type I error (number rejection of null hypothesis divided by 10,000) was evaluated by Pearson's c 2 test, i.e. by classical c 2 test without continuity correction.
In the case of 2x2 contingency tables, where the type I error is greater than the significant level, the c 2 test equation to be used is as follows: [4] instead, where the type I error is less than the significant level, the c 2 test equation is [5] where O i,j and E i,j represent the observed and expected frequencies respectively, instead C is the developed correction value. It was computed in two cases as follows, if the type I error is higher than the significant level, the authors try to replace the value C into equation (4) start from 0.01, 0.02, 0.03, ..., . If the type I error is less than the significant level, they try to replace the value C into equation (5) start from 0.01, 0.02 , 0.03 ..., . After they replaced value C and computed type I error then to compared with significant level. Developed correction value (C) is the value which gets very similar values between type I error and significant level.

CONCLUSION
In this paper we described the most recent studies of continuity correction of Pearson's c 2 test. Since the first continuity correction proposed by Yates (1934), produced an overcorrection of the p-value, many authors are discouraging its use. Instead other authors [13][14][15][16][17][18], have followed Yates (1934) in claiming that the use of Pearson's c 2 in the case of 2x2 contingency tables tends to generate too many type I errors, especially with small samples, therefore they defined different continuity corrections of Pearson's c 2 statistic, to reduce the type I error, and simultaneously to reduce the type II error that Yates's correction introduces Unfortunately, the study of continuity correction of Pearson's c 2 statistic is very limited in the recent statistical literature, only two recent studies are dedicated at this problem (Serra N

Funding statement
This research did not receive any specific grant from funding agencies in the public, commercial, or not for profit sectors.