Should methods of correction for multiple comparisons be applied in pharmacovigilance ? Reasoning around an investigation on safety of oral antidiabetic drugs

BACKGROUND: In pharmacovigilance, spontaneous reporting databases are devoted to the early detection of adverse event ‘signals’ of marketed drugs. A common limitation of these systems is the wide number of concurrently investigated associations, implying a high probability of generating positive signals simply by chance. However it is not clear if the application of methods aimed to adjust for the multiple testing problems are needed when at least some of the drug-outcome relationship under study are known. To this aim we applied a robust estimation method for the FDR (rFDR) particularly suitable in the pharmacovigilance context. METHODS: We exploited the data available for the SAFEGUARD project to apply the rFDR estimation methods to detect potential false positive signals of adverse reactions attributable to the use of noninsulin blood glucose lowering drugs. Specifically, the number of signals generated from the conventional disproportionality measures and after the application of the rFDR adjustment method was compared. RESULTS: Among the 311 evaluable pairs (i.e., drug-event pairs with at least one adverse event report), 106 (34%) signals were considered as significant from the conventional analysis. Among them 1 resulted in false positive signals according to rFDR method. CONCLUSION: The results of this study seem to suggest that when a restricted number of drug-outcome pairs is considered and warnings about some of them are known, multiple comparisons methods for recognizing false positive signals are not so useful as suggested by theoretical considerations.


INTRODUCTION
Spontaneous reporting (SR) databases are useful tools to generate signals, i.e. abnormal or unusual reporting patterns suggestive of increased health risks associated with the use of a given drug [1][2][3].Although they provide answers in a timely and cost-effective fashion, it should be considered that a wide number of possible associations are concurrently investigated by such approach.This implies a high probability of generating positive signals (i.e.statistically significant drug-outcome associations) simply by chance.False positive signals make interpretation of the entire panel of results difficult.It would then be helpful to minimize this source of error to clarify the focus for further research [4,5].
Different approaches addressing massive hypothesis testing have been developed.A conservative approach is to control the Family Wise Error Rate (FWER) that is the probability to reject at least one true null hypothesis among all tested; the Bonferroni method is one of the most used to account for this error [6].A less conservative approach is to control the False Discovery Rate (FDR) i.e., the expected proportion of false positive findings among all the rejected hypotheses [7].It should be considered, however, that a major assumption of FDR is that then p-values have to be uniformly distributed under the null hypothesis.Pharmacovigilance generally aims to detect signals, thus one-sided hypothesis tests are of interest.However, when one-sided hypothesis tests are performed, the uniformity assumption of p-values is systematically violated making the classical FDR approach inapplicable.Recently, Pounds and Cheng proposed a robust method for the estimation of FDR (rFDR) that overcome this assumption [8].
We exploited the data available for the Safety Evaluation of Adverse Reactions in Diabetes (SAFEGUARD) EU project, an international consortium aimed to assess the safety of non-insulin blood glucose lowering (NIBGL) drugs, to evaluate the need to apply multiple testing correction, through the rFDR in pharmacovigilance when a restricted number of hypotheses is tested [9][10][11][12][13].

Data sources
We used the data retrieved from two SR databases namely FDA-AERS and EudraVigilance.The FDA-AERS database was set up from 2004 in the United States and receives adverse drug reaction reports from healthcare professionals, patients and drug manufacturers worldwide.A public, anonymized version of the FDA-AERs database is readily accessible by downloading data files from the FDA website.

CONFLICT OF INTEREST
The authors declare no conflict of interest.
The results of this study were never presented before e 1 1 6 5 4 -2 since this time-window was available for both databases.

Drug assessment
In the public version of the FDA-AERS database the coding of drug names is highly variable and only partially standardized to the FDA's drug dictionary.This limitation makes difficult the identification of all the NIBGL drugs difficult.Thus, the strategy adopted for the current study was to identify as many NIBGL drugs as possible by mapping reported drug names with a reference list of generic and trade names for the NIBGL drugs.This reference list was compiled manually using the on-line version of Martindale: the Complete drug Reference (www.medicinescomplete.com; accessed 30/11/12).All other drugs were regarded as non-NIBGL agents.NIBGL drugs identified by this process were recoded with their generic name and subsequently standardized using the ATC classification for the purposes of analysis.In the EudraVigilance dataset, drug names were coded to generic drug name at source.

Outcome assessment
The outcomes of interest for the current study were the following: ventricular arrhythmia, heart failure, myocardial infarction, haemorrhagic stroke, ischemic stroke, sudden cardiac death, acute pancreatitis, pancreatic cancer and bladder cancer.
The MedDRA terminology dictionary was used to code reactions in both the FDA-AERS and EudraVigilance databases [http://www.meddra.org/].In order to extract cases from these databases, each outcome of interest was defined by two pre-specified lists of preferred term.

Raw signal generation
Following Evans et al., signal generation was based on the disproportionality approach [14].The disproportionality measure used for signal generation in this study was the proportional reporting ratio (PRR).The PRR is the ratio between the proportion of outcomes of interest among all reported for a considered drug and the proportion of outcomes of interest among all reported for all other drugs.To evaluate if a drug was significantly associated to a specific outcome, the z-test based on a large-sample normal approximation was performed on the logarithmic transformation of the PRR.A signal was generated whenever the null hypothesis of proportionality for the natural logarithm of PRR (i.e., H 0 : ln(PRR)≤0) was rejected (p-value ≤ 0.05), favouring the alternative onesided hypothesis of disproportionality (i.e., H1: ln(PRR)>0) as more convincing.

Multiple testing
Consider the situation of testing simultaneously m (null) hypotheses of which m 0 are true (Table 1).R, U, V, S and T are unobservable random variables, R representing the number of rejected hypotheses, U and S the number of correctly classified hypotheses and T and V the numbers of erroneously classified hypotheses.
Benjamini & Hochberg originally defined the FDR as the expected proportion of false positive findings among all rejected hypotheses, given that at least one null hypothesis is rejected, multiplied by the probability of making at least one rejection FDR=E(V⁄R|R>0)Pr(R>0) [7].The estimation of the q-values, the natural FDR analogues of p-values, corresponding to a given set of raw ordered p-values p (1) ≤ p (2) ≤ … ≤ p (m) is based on the local FDRs.These are defined to be lFDR i =ṽ(p (i) )/F (p (i) ) for i=1,…,m, where ṽ(p (i) ) is the estimated expected proportion of false positives when p (i) is used as threshold for evaluating the significance of each test, and F (p (i) ) the proportion of p-values less or equal to , p (i) i.e.Pr(p≤p (i) ).
Pound & Cheng proposed a robust estimation procedure for the FDR (rFDR) [8], where ṽ(p a U-shaped distribution (as might typically happen in real applied settings) [8].For each raw p-value p (i) , the corresponding q-value is q i =min j≥i lFDR i .For example, if q i is less than 0.05, then all hypotheses associated with the p-values from p (1) to p (i) can be rejected ensuring that FDR does not exceed 0.05.

Concordance between raw and rFDR methods
Two I x J matrices crossing the I NIBGL drugs and the J adverse reactions of interest (identified using the narrow definition) were separately built from each database.Collapsing these data into as much 2 x 2 contingency tables as NIBGL drug-outcomes pair with at least one report were observed, PRR point estimates and corresponding raw p-values, as well as the rFDR q-values, were calculated.
Concordant-discordant matching pairs comparing raw and rFDR signals were counted.A drug-outcome pair was considered concordant if raw and rFDR method tied the same classification in term of significance of a given signal (i.e., the p-and q-value were either both ≤ 0.05 or both > 0.05), discordant otherwise.

RESULTS
In total, 261 pairs (i.e., 29 drugs x 9 reactions) were evaluable.Table 2 reports the number of concordant and discordant drug-outcome pairs identified by raw and rFDR respectively within FDA-AERS and EudraVigilance databases.From the FDA-AERS database, 140 (54%) nonempty cells were obtained of which 55 (39%) concerned raw signals; perfect agreement was observed between raw and rFDR.From the EudraVigilance database, 171 (66%) nonempty cells were observed of which 51 (30%) concerned raw signals.Only one false positive signal, i.e., the effect of gliquidone on acute pancreatitis, resulted from the rFDR.A total of 71 individual concordant drug-outcome signals were confirmed by the rFDR method considering both databases (data not shown).

DISCUSSION
The current study investigated the possible association between 29 antidiabetic agents and 9 outcomes, that is 261 potential drug- the open question is whether all these signals are due the drug effect or if some of these (how many? which ones?) have been generated by chance.The approach followed in our paper was to control for false positive signals through a robust estimation method of the false discovery rate.
The FDR-type methods has been described as particularly suitable for screening purpose, as it is the case of signal generation in the setting of pharmacovigilance [15].However, theoretical considerations and simulation studies, have shown that the classical FDR approach may be too conservative [16].This occurs mainly when the assumption of uniformity of the p-values under the null hypothesis required by the classical FDR method is violated, as in the case of one-sided hypotheses.The rFDR estimation method was developed to overcome the assumption.However, FDR-type methods, including its robust version, are typically applied in genetics, a setting where thousands of tests are simultaneously performed in almost total absence of a priori knowledge.In the pharmacovigilance setting, however, and particularly in the application presented in the current study, a more restricted number of tests (drug-outcome pairs) is of interest.Our findings showed that among the 96 signals generated from the conventional (raw) approach, almost all (95) were confirmed after the application of the rFDR method.The only signal detected as false positive by the rFDR concerns the effect of gliquidone on acute pancreatitis.This evidence was confirmed by another study based on FDA-AERS database [17].However, to our best knowledge no other studies were published on the effect of gliquidone on the risk of acute pancreatitis.
The empirical evidence that rFDR detected a negligible proportion of false positive signals in our application, may have several explanations.First, the number of hypotheses simultaneously tested is restricted in our setting, so that it is possible that the probability of generating a false positive signal is not as much inflated as if the number of tests would be much larger.
Secondly, since some of the drug-outcome associations of interest are already known, the number of related reports is expected to be high.For example, the relationship between rosiglitazone and cardiovascular and cerebrovascular outcomes is well known and largely documented in the scientific literature [18][19][20][21][22].The number of reports regarding the relationship between rosiglitazone and three cardio-cerebro-vascular outcomes, namely, myocardial infarction, heart failure and ischemic stroke was respectively 15,040, 10,018 and 3,004 in the FDA-AERS database and 8,086, 7,270 and 5,967 in the Eudravigilance database.Similar results were found for the pancreatic safety of exenatide.Some evidence suggested a possible role of incretin mimetic drugs in the onset of pancreatic outcomes even if this topic is still debated [23][24].The number of reports regarding the relationship between exenatide and acute pancreatitis and pancreatic cancer are 2,235 and 222 in the FDA-AERS database and 1,742 and 221 in the Eudravigilance database.All these evidence should lead to highly significant signals that would be unlikely detected as false positive by the rFDR approach.Thus, the number of potential false positive signals is limited by design in this setting.
It should be noticed that the analysis of spontaneous report databases is subject to several type of biases that are related to the spontaneous character of the reports.In particular, it is well known that the information reported in these databases are uncontrolled and thus may be affected by a number of reporting related biases.These biases includes the length of time a product has been on the market, country, reporting environment, detailing time and quality of the data [25].Additionally, reported cases may differs from unreported ones in terms of disease severity or other clinical characteristics.Moreover the ability to assess, analyse and act on safety issues based on spontaneous reporting depends on the quality of the report [25].Finally, the disproportionality e 1 1 6 5 4 -5 ADJUSTMENT FOR MULTIPLE COMPARISON ADJUSTMENT IN PHARMACOVIGILANCE measures calculated using these data may be affected by confounding and cannot take into account difference in patients' clinical profiles and presence of co-medications.Given these limitations, it is possible that some of the signals may be generated erroneously as a results of the combination of different uncontrolled causes.This fact may explain, for example, the signals associate to metformin use.The use of this drug, in fact, was associated to myocardial infarction and heart failure detected in both databases, but it is also well known that metformin has a acceptable cardiovascular safety profile [26].However, these false positive signals might be due to a systematic error, such as confounding, and cannot be discarded using methods, like the rFDR, that act on random error.

CONCLUSIONS
These considerations taken together seem to suggest that when a restricted number of drug-outcome pairs is considered and warnings about some of them are known, multiple comparisons methods for recognizing false positive signals are not so useful as suggested by theoretical considerations.
(i) ) is estimated as where π is the cross-validation estimate of the proportion of true null hypotheses based on the distribution of observed raw p-values.This estimator is modified for p (i) > 1/2 to avoid producing exceedingly large lFDR i values for large p (i) s if observed raw p-values follow e 1 1 6 5 4 -3 BIOSTATISTICS Epidemiology Biostatistics and Public Health -2015, Volume 12, Number 4 ADJUSTMENT FOR MULTIPLE COMPARISON ADJUSTMENT IN PHARMACOVIGILANCE