Discovering potential blood-based cytokine biomarkers for Alzheimer’s disease using Firth Logistic Regression

Background: Alzheimer’s disease (AD) is a neurodegenerative disorder where patients suffer from memory loss, cognitive impairment and progressive disability. Individual blood biomarkers have not been successful in defining the disease pathology, progression and diagnosis of AD. There is a need to identify multiplex panels of blood biomarkers for early diagnosis of AD with high sensitivity and specificity. This study focused on identification of cytokine biomarkers. The maximum likelihood estimates of the ordinary logistic regression model cannot be obtained when there is complete separation and the alternative is Firth logistic regression which uses a penalised Maximum Likelihood in parameter estimation. Methods: This paper reports a Firth logistic regression application in finding potential blood-based cytokine biomarkers for Alzheimer’s disease in a matched case control study. We used a principle component analysis to discriminate the correlated, completely separated covariates. Results: The Firth logistic regression results showed that nine individual biomarkers IL-1


INTRODUCTION
Alzheimer's disease (AD) is a neurodegenerative disorder characterised by the gradual progression of memory loss, impairment of cognitive functions and progressive disability that accounts for 60% to 80% of all types of dementia [1][2][3]which is thought to be a powerful strategy to examine the influence of genetic variants (i.e., single nucleotide polymorphisms (SNPs. The disease is also commonly characterised by the development of amyloid-beta (Aβ) plaques and hyper-phosphorylated tau neurofibrillary tangles that leads to neuronal death or apoptosis and memory decline [4]. AD can incur tremendous social and economic costs, to the sufferer as well as the caregiver and family members. Genes associated with the development of AD might have an effect on chemical neurotransmitters, which allow message to be communicated between nerve cells in the brain [5].
The prevalence of AD was estimated to be 47 million cases worldwide according to World Alzheimer Report 2016 [6]. The number is expected to increase to 131.5 million by 2050, affecting mostly low and middle income countries [7]. In Malaysia, AD cases are believed to be under-reported because most family members view its symptoms as normal aging and hence they do not seek any medical treatments. Cytokines are dissolved proteins or glycoproteins produced by the leukocytes. They act as chemical communicators between cells in a way similar to hormones but with the strongest activity in the microenvironment of the cells that they are contained within [8,9]. Cytokines are involved in both healthy biological processes such as cell growth, differentiation, inflammation, immunity, repair and fibrosis as well as pathological processes [10]. The word cytokine comes from Greek where cyto means cell and kinos means movement. These words reflect their role in cellular dynamics towards an infection [9].
Individual cytokine-based biomarkers of AD have been established in several studies. Mire-Sluis [11] serum and cerebrospinal fluid (CSF, reviewed potential biomarkers that contribute to the pathology of several neurodegenerative diseases, including AD by comparing more than 50 cytokines from over 100 publications . The  potential blood-based biomarkers identified for AD are  FGF1, IL-11, IL-18, ACT, IL-1β, GMCSF, HGF, IFN-Y,  IL-1RA, IL-2, IL-2R, IL-10, IL12, MIP-1α, SDF-1α, sTNF-R1 and sTNF-R2. AD related down-regulated cytokines, namely IL-6R, IL-6, TNF-α and transforming growth factor beta (TGF-β), were also discovered. Furthermore, interleukin 1 beta (IL-1β) was reported to be an important contributor to AD where it is a master regulator of neuroinflammation produced by active inflammatory cells of myeloid lineage in microglia. In addition, serum amyloid A (SAA) was detected together with 1L-1β-immunopositive microglia for AD patients. Thus a link was proposed between P2X7R, SAA and 1L-1β in the central nervous system (CNS) pathophysiology [12].
Several cytokines such as INSR, VEGF-A, PRKACB, DLG4 and BCL2 are presumed to be involved in manganese-inducing AD [13] while 1L-1β and TGFβ can increase the amount of released VEGF (A to E) in certain cells [14]. In addition, according to [14], IL-1β and TGFβ act as mediators in paracrine VEGF-A production. Dayana et al. [15] investigated 12 cytokines and reported that CXCL-10 and IL-13 were promising cytokine biomarkers for AD. CXCL-10 was found to be significantly negatively correlated with Mini-Mental State Examination (MMSE) scores while IL-13 had a significant positive correlation with MMSE in AD patients.
In statistical modelling, logistic regression is commonly used for modelling a binary dependent variable. However, when the dataset is small, the maximum likelihood estimation (MLE) for logistic regression faces several problems such as biased or infinite estimates of the regression coefficients and frequent convergence failure of the likelihood due to separation [16][17][18]. When the data pattern shows complete separation or quasi complete Discovering potential blood-based cytokine biomarkers for Alzheimer's disease using Firth Logistic Regression separation, then the MLE is non-existent (this phenomenon is called monotone likelihood) [19]. Figure 1 illustrates the complete separation for a continuous and categorical variable. Figure 1(a) shows an example of separation due to a continuous variable while Figure 1(b) shows separation due to a dichotomous covariate.
When data separation occurs, three alternatives are frequently employed: (1) increasing the sample size, (2) combining multiple categorical variables and (3) omitting the category (for more than 2 categories). However, increasing the sample size in a clinical study is not financially and administratively practicable, while to combine categories is not always possible especially when there are only two categories and each category is mutually exclusive. To omit the category might be dangerous if the category is important in the study. The Firth's penalised MLE method and the exact logistic regression method can address the separation issue. The exact method is computationally demanding, where it is infeasible when the sample size is greater than 100 [20,21]this study analyzes the effects of individual and built environment characteristics on the route choice using binary logistic regression of 524 survey responses. Conducted in a strategic area, the survey, as often is the case, collects data that are skewed and face the separation issue-the same outcome always occurs for a particular value of a predictor-according to which estimates by the conventional maximum likelihood (ML. The exact method permits replacement of the unsuitable maximum likelihood estimate by a median unbiased estimate [19]. The Firth logistic regression (penalised MLE for logistic regression) works well with multiple predictors and large sample size and it produces nearly unbiased estimates of the coefficients [19,22,23]. Rainey [24] suggested to select between a range of priors in logistic regression such as informative normal prior, Cauchy (0, 2.5) prior, Jeffrey's invariant prior, skeptical normal (0, 2) prior and an Enthusiastic normal (0, 8) prior. Detailed explanation about Firth penalised maximum likelihood is presented in Supplementary Material.
The aim of this study is to illustrate the use of Firth logistic regression in discovering potential cytokine biomarkers data for AD. This paper is organised as follows; the next section covers the description of the data and the Firth's penalised maximum likelihood, the results and discussions are presented in Section 3 while Section 4 concludes the paper.

METHODS
The data for this paper were obtained from a study "Towards Useful Ageing -Neuroprotective model for healthy longevity among Malaysian elderly" or TUA (the Malay language word for ageing) under the Long-Term Research Grant Scheme (LRGS) programme of Ministry of Education, Malaysia. The ethical approval for TUA programme was obtained from the ethics committees of both Universiti Teknologi MARA (reference no: 600-RMI [5/1/6/01]) and University of Malaya Medical Center (UMMC; reference no: PPUM HU-61/12/1-1]). The dataset consists of 39 people living with AD and 39 age matched healthy control (HC) who were recruited from the Memory and Geriatric Clinic, UMMC. The inclusion criteria for AD patients were age 65 years or older and fulfilled all the conditions of probable AD based on the Revised National Institute of Neurological and Communication Disorders (Alzheimer's disease and Related Disorders Association criteria). Patients were diagnosed as possible or probable AD by a neurologist or geriatrician. The selection criteria were also based on mini mental state examination (MMSE) score (less than or equal to 26). The exclusion criteria were as follows: (1) age less than 65 years old; (2) functionally independent patients as measured by Katz basic activities of daily living and Lawton instrumental activities of daily living (IADL) scales; (3) the MMSE score of more than 27; and (4) patients could communicate and give informed consent.
The inclusion criteria for HC were age more than 65 years old and absence of any documented history of memory or other cognitive impairment, major psychiatric illnesses or mental disorders and concomitant diseases. Selected HCs were also functionally independent as measured by Katz basic activities of daily living and the Lawton instrumental activities of daily living scales, with MMSE of more than 27.
After going through the study information sheet and obtaining consent, 8.5 ml of blood were withdrawn from each patient. Within 30 minutes after venipuncture, the blood samples were centrifuged at 1050g for 3 minutes. The resultant supernatant (serum fraction) was transferred and divided into four divisors of 400 µl of serum. Next, 25µl of serum was added to platinum enzyme-linked immunoassay (ELISA) kits to bind and detect specific targeted cytokines. The detail of the extraction process can be found in [15,25]. Thirteen cytokines were found to be relevant in AD which could be grouped into classical and non-classical inflammatory cytokines. The former were IL-1β, IL-6, IL-12, IFNγ, TNF-α and TGF-β, while the latter, CXCL-1, IL-8, IL-10, IP-10 or CXCL-10, MIP-1α, MCP-1 and IL-13.

Data pre-processing and preliminary analysis
The data were analysed using the R programming language, an open source software for statistical analysis [26], after the data were checked and cleaned to ensure the validity and reliability of all observations. The preliminary analyses included descriptive statistics such as frequency, percentage, mean and standard deviation to assess the distribution of the data.
The demographics of subjects such as age, gender, ethnicity, smoking history and history of alcohol consumption were recorded to outline their physiognomies. The binary logistic regression was employed to ascertain the association between AD and physiognomic variables. Initially, the univariable logistic regression was performed to assess the relationship of these variables, then the multivariable logistic regression was performed to evaluate relationship of the variables with the existence of other variables. The performance of this model was reported by using measures such as sensitivity, specificity, accuracy, Akaike Information Criterion (AIC) and area under the receiver-operating characteristic (ROC).
To establish the predictive model of blood-based cytokine biomarker for AD, we only focused on the cytokine variables only. We started the analysis by detecting the complete separation variables in cytokine using boxplot graphics and confirmed the results using linear programming method developed by [27]. These were done to know the suitability of Firth logistic regression application in the dataset.
Then, univariable Firth logistic penalised ML method was carried out to determine the significant individual biomarkers. In developing the univariable analysis for cytokines, data were not sliced into training and test set because the existence of the complete separation of variables in the dataset and due to the small sample size. Then only cytokines with p-value less than 0.25 [20] were selected to be in the multivariable Firth logistic regression. The forward selection, backward elimination and stepwise selection for Firth logistic regression were applied for feature selection.
However, since the data have covariates that were highly correlated which would lead to multicollinearity, we performed principle component analysis (PCA) to group the correlated cytokines together. Bartlett's test of sphericity was used to test if the correlation matrix is an identity matrix and then, a PCA using a varimax rotation was carried out.
At the final stage, we conducted the multivariable Firth logistic regression with the principal components. The performance of the final model was measured with Hosmer-Lemeshow test, sensitivity, specificity, AIC and area under ROC curve. We also did a comparison of the classification performance with different number of components. Table 2 shows the frequency distribution of the physiognomic variables for HC and AD and univariable logistic regression analysis results. There is no significant relationship between AD and HC with gender, ethnicity and smoking status. However, history of alcohol consumption and age were statistically significantly related to AD (p-value < 0.05).

RESULTS AND DISCUSSIONS
Multivariate logistic regression was employed to determine the relationship of physiognomic factors (gender, age and history of alcohol consumption) with AD using Forward and Backward LR (Likelihood ratio) selection method. Only Age and Alcohol were selected as significant predictors. Based on Hosmer-Lemeshow test, the model with two covariates (Age, Alcohol) best fits the data (Chi-Square (8 df) = 7.9134, p-value: 0.442). The Akaike Information Criterion (AIC) was found to be 75.55, where the accuracy and sensitivity were 80.77% and 82.05%, respectively. The area under the ROC curve was 0.87.
In Table 3, the odds-ratio for age (OR=1.29) indicates that for every one-year increase in age, the odds of getting AD increases by 29% when controlled for alcohol. Additionally, those who consumed alcohol are 3.8 (OR=1/0.26=3.8) time more likely to have AD.
Forward likelihood ratio method and backward likelihood ratio method were tested, multicollinearity and clinically plausible interaction checked. The model adequacy was checked using Hosmer-Lemeshow test. Diagnostic measures including outlier identification and influential statistical were done.

Separation detection
In Figure 2, we present the five cytokines that showed complete separation between AD and HC: IL-1β, IL-6, IL-10, IL-13 and IP-10 (CXCL-10). In addition, the results in Table 4, confirmed that only these stated cytokines had complete separation issue since the intercept and coefficient were not equal to 0 (β ≠ 0) and were infinite [27]. Since these cytokines had complete separation between AD and HC, we can conclude that these  Discovering potential blood-based cytokine biomarkers for Alzheimer's disease using Firth Logistic Regression variables are potential biomarkers of AD. Detection of separation was done using linear programming method developed by [27] via R programming package ("brglm2"). The intercept and coefficient values for each variable should equal to zero to indicate there are no separation. If the intercept or coefficient is not equal to zero, it indicates that the data has separated condition (either quasi-separation or complete separation).

Fitting Firth's logistic regression
The univariable Firth logistic regression results in Table  5 show that all cytokines were significantly associated with AD except TGF-β, TNF-α, CXCL-1 and IL-8 because the Firth's p-values were more than 0.05 and the AICs for these biomarkers were significantly higher compared to other cytokines. Next, we fitted multivariate Firth logistic regression model using variables with p-value ≤ 0.25 in Table 5. The forward selection, backward elimination and stepwise selection algorithms were applied for the Firth's logistics regression with nine biomarkers. The forward selection and stepwise selection selected only IL-13 in the final main effects model. On the contrary, only IL-6 was selected in the final main effects model for backward elimination. The non-agreement in selecting the best main effect model is due to high correlation of the covariates (biomarkers). Table 6 shows the Pearson's correlation between the covariates. IL-1β has high correlations (0.80) with IL-6, IL-13 and IP-10. Meanwhile IL-6 has high correlation with IL-1β, IP-10 and IL-13.
Next, the PCA was applied to cluster correlated  Discovering potential blood-based cytokine biomarkers for Alzheimer's disease using Firth Logistic Regression  Discovering potential blood-based cytokine biomarkers for Alzheimer's disease using Firth Logistic Regression covariates for further analysis. All nine covariates that were statistically significant in Table 5 were considered for PCA. The correlation matrix was used to calculate Bartlett's test of sphericity. The Bartlett's test result was significant and thus we could proceed with PCA [chi-squared (df): 506.53 (36)]. A PCA using a varimax rotation of all nine covariates was carried out. The covariates with a factor loading of 0.4 and higher (indicating satisfactory loading) were regarded as valid and significant contributors to the component. In PCA, only the first component (PC) had eigenvalue more than 1 (eigenvalue: 5.61) and the percentage of variance explained was 62.32%. Based on factor loadings, 4 PCs were extracted, where the total variance explained was 85.81%.
Then, a Firth's logistic regression model using the four principal components was fitted. Table 8 showed that all four components were statistically significant. All the PCs   show an impact on AD since the adjusted odds ratio were larger than 1. The biggest contribution is PC 1, which has 4 covariates involved (IL-1β, IL-6, IL-13 and MCP-1α), followed by PCs 3, 4 and 5.
The model with four PCs fitted the data best based on Hosmer-Lemeshow test (Chi-Square (8 df) =2.78, p-value: 0.9472). The AIC was found to be 15.441, where sensitivity and the under the ROC curve were 97.22% and 1.00, respectively.
We then compared the classification performance using different numbers of components. Firth's logistic regression model with PC1 only had an accuracy of 78.2% with sensitivity and specificity of 71.8% and 84.6% respectively. Table 9 shows the hierarchy of model performance starting from PC1 and adding all PCs in the Firth's logistic regression model. The classification performance increased for every PC added and achieved 92.3% accuracy, sensitivity, specificity and precision when we used three PCs.

CONCLUSION
This paper focused on the issue of complete separation of data and illustrates the use of Firth Logistic regression as an alternative to the classical logistic regression model.    Discovering potential blood-based cytokine biomarkers for Alzheimer's disease using Firth Logistic Regression The aim was to establish a reliable prediction model for AD cytokine biomarkers. In the presence of complete separation, the classical binary logistic regression would produce infinite MLE estimates of the coefficients (nonexistence of an MLE for non-overlapped data) [24]. Firth's logistic regression was used to investigate the relationship between cytokines and AD as it can cater for complete separation issue. Due to the presence of multicollinearity among the biomarkers, we used principal component analysis techniques to cluster the biomarkers. This study found that IL-1β, IL-6, IL-13 and MCP-1α were the important biomarkers. Yin [28] reported that IL-1β and homozygous APOEɛ4 combined are associated with increased hazard in developing AD. IL-1β was also reported with six accompanying pathways in Cytoscape that linked them to AD [29]. The importance of IL-6 was also supported by [30] which reported that the levels of IL-6 and IFN-γ were significantly higher in altered T-lymphocytes of AD compared to HC. In our study, IFN-γ was found to have a significant relationship with AD. In addition, some studies have reported that the increment of IL-6 would influence the progression of the cognitive decline in AD [31,32].
Furthermore, tumor necrosis factor (TNF-α) was found to be insignificant and did not affect AD based on univariable Firth's logistic regression. This result was supported by [33] and the authors concluded that the alterations in immunological conditions involving tumor necrosis factor mediated signaling were not the primary events in commencing AD pathology including amyloid plaques and tangle development.
Next, IL-10 (PC 4) was also found to be associated with an increase in risk for AD in multivariable Firth (OR: 4.47). The IL-10 and IL-13 are said to be antiinflammatory cytokines by virtue of their ability to suppress genes for pro-inflammatory cytokines [34]. These results are in line with the findings by Dayana et al. [15].
Then, the odds-ratio for interferon gamma-induced protein 10 (IP-10) or C-X-C motif chemokine 10 (CXCL-10) in PC 3 indicated an elevated risk for AD. These results were supported by [35], where the authors stated that CXCL-10 were positively correlated with the severity of cognitive decline in AD patients. Furthermore, in animal studies, CXCL-10 has been implicated in disease progression of APPSWE mouse. It had been demonstrated that the ablation of CXCL-10 receptor, chemokine (C-X-C motif) receptor 3 in APPSWE/ PS1∆E9 mice ameliorated amyloidosis and cognitive decline [35].
In conclusion, Firth's logistic regression is a useful technique for the identification of significant biomarkers when there is an issue of data separation. The results of this study can be validated by increasing the sample size of study. In future work, we seek to develop an efficient prediction model for AD by combining cytokines, transcriptomics and proteomics biomarkers.