Choosing the right strategy to model longitudinal count data in Epidemiology : An application with CD 4 cell counts

BACKGROUND: Statistical models for analysis of correlated count data are important for answering epidemiological issues that involve taking individual count measurements repeatedly over time through the use of longitudinal studies. Conventional regression models for this type of data are inadequate and can lead to inappropriate conclusions and inference. Longitudinal studies in Public Health involve evaluation and monitoring of patients with infectious diseases, such as HIV/AIDS, to assess their health status, to check the effectiveness of the treatment, and to make prognosis about the evolution of the disease, including interdependencies of clinical manifestations. The purpose of this article is to describe various statistical strategies for the analysis of longitudinal count data with emphasis on how to choose the most suitable model for the data and in the interpretation of the results. METHODS: We illustrate the applicability of various statistical strategies by evaluating the effect of associated factors on lymphocyte CD4+T cell count in HIV seropositive patients in Salvador, Bahia, Brazil. We describe the Poisson and Negative Binomial models using the multilevel (ML) approach and the generalized estimations equations (GEE) for the analysis of longitudinal count data. RESULTS: The interpretations of the results derived from ML and GEE differ and thus their direct comparison should be avoided. CONCLUSION: We believe that the statistical methods for the analysis of longitudinal studies with correlated count data can be useful to address several important issues in public health, especially in helping to monitor patients and checking the effectiveness of treatments.


INTRODUCTION
The use of statistical models for the analysis of correlated count data has grown in public health research.In fact, a recent review shows that from a selection of 108 articles in the medical field between 2000 and 2012 using generalized linear mixed models (GLMMs) 20.4% considered models for count data [1].The most common models for count data include the Poisson and Negative Binomial (NB) distributions.Christofides  collaborators, for example, used the Poisson random effects model to predict the incidence of pregnancy due to sexual violence [2] and a hospital surveillance study used the NB model to identify the associations between year, seasons and rate of infections to evaluate infection by Clostridium difficile [3].
Longitudinal Studies (LS) are crucial in dealing with epidemiological issues as they trace the behavior of the response variable (outcome) over time, investigating the effect of covariates on the profile of the outcome, making predictions, and assessing global or individual changes in the response over time [4].In biostatistics these studies are known as cohort studies, whereas in areas such as sociology, economics and business, they are referred to as panel studies.Data from LS have unique characteristics, including their temporal ordering and dependence between consecutive measurements [4].
Past literature has focused mainly on the description of models for correlated data of continuous or binary responses for applied researchers, referring particularly to its characterization and application using linear or logistic models [5][6][7][8][9].The theoretical developments of Poisson and NB models have led to methodological extensions for analysis of longitudinal data in the recent past.These lastest innovations are currently available in statistical software.Nevertheless, they have not been widely used and disseminated by applied researchers, which is partially due to their technical complexity.At the same time, there is a consensus about the inappropriacy of using conventional regression models to fit correlated data, which would provide incorrect standard errors and, consequently, could lead to misleading inference and conclusions [4][5][6][7][8][9].Therefore, there is a need for a unified framework to present and describe these methods making them easily accessible to researchers in the field of Public Health.This paper aims to present distinct modeling strategies for the analysis of longitudinal count data, explaining their use and limitations so as to promote a better understanding of the usefulness of these tools in answering scientific questions in Epidemiology.To illustrate this, we analyze data on the number of CD4+T lymphocytes repeatedly measured in HIV seropositive individuals in Salvador, Bahia, Brazil.
HIV/AIDS remains a global challenge and a major public health problem.The World Health Organization estimates that so far around 25 million men, women, and children have died from AIDS worldwide [10].In Brazil, 544,846 AIDS cases were reported from the beginning of the epidemic until 2009.In the city of Salvador, the capital of the State Bahia in the Northeast of Brazil, 2,944 new cases were registered between 2000 and 2008.In the context of this epidemic, the assessment of the number of CD4+ T lymphocytes over time is important to monitor the history of HIV infection and its consequent progression to AIDS.The CD4+ T cell counts and the quantification of the viral load have been used both in the indication and evaluation of the need for modification of antiretroviral regimens [5].Given the magnitude of this epidemic and the methodological challenges to implementing a more effective and robust analysis, we characterize different statistical strategies for analyzing longitudinal count data, illustrating their applicability and interpretation through the evaluation of the effect of factors associated with CD4+ counts in HIV patients.

METHODS
Data.The data refers to 587 HIVseropositive patients in the city of Salvador, who were registered on the Laboratory Testing Control System (SISCEL, in Portuguese) of the Brazilian Ministry of Health between January 2002 and August 2012.This system was developed with the purpose of monitoring the laboratory procedures of lymphocyte T CD4/CD8 cell counting and to perform the quantification of HIV viral load both for treatment indication and for monitoring patients undergoing antiretroviral therapy.
The CD4+ cell counts have great intra and inter variability, specifically when values are above 200 cells/mm³, hindering its identification in the early stages of the infection.So far there is no objective ideal value of the number of CD4+ cells to specify the beginning of the antiretroviral treatment for all patients because the rate of disease progression can vary widely among individuals.The viral load (VL) is defined as the number of virus copies in 1 milliliter of blood.Initial results in untreated patients can reach up to 1 million e 1 1 5 2 0 -2 or more copies/ml.During treatment, high VL is between 5,000 and 10,000 copies/ml.A low VL (between 40 and 500 copies/ml) indicates slow disease progression [6].Given the usual limits for the CD4+ count and VL, individuals who had CD4+ above 1,500 cells/mm³, either at baseline or during follow-up periods, and those with a VL greater than 1 million copies/mm at baseline were excluded from our analysis.
Statistical modeling considered the number of CD4+ cells as the outcome, which was measured at different points in time after receiving the antiretroviral therapy provided by the government program.The covariates include patient's gender (0 = male, 1 = female) and the following information at baseline: age (in years); a dummy variable treatment (0 = he/she was not in treatment before registration at SISCEL, 1 = he/she was in treatment); the categorized CD4+ in baseline (0 if CD4+ <350 cells/mm³, and 1 if CD4+ ≥ 350 cells/mm³) and categorized VL (0 if VL <500 copies/ml, 1 if VL between 500 (inclusive) and 5,000 copies/ml, and 2 if VL ≥ 5,000 copies/ml).The time variable, in years, was also included in the model to indicate when the CD4+ count was taken after registration at SISCEL.Furthermore, the individuals do not have the same number of repeated measurements (number of observations per patient ranged from 1 to 24) and the measurements were taken at different time points, i.e. the study is unbalanced and unequally spaced.

Statistical Models
Regression Models for Count Data.Count data are quite common in epidemiological studies.This type of data assumes only nonnegative integer values (i.e.0, 1, 2, ...) and is usually modeled using the Poisson distribution, which is characterized by having equal mean and variance of the response variable (Y i ), hence, E(Y i ) = Var (Y i ) = μ i .However, when overdispersion is present, i.e. when E (Y i ) < Var (Y i ), the Poisson model is no longer appropriate.In such situations, the Negative Binomial model can be used, and is denoted by , where α controls for the overdispersion [11].To illustrate the use of these models, consider our CD4+ data.Let Y i be the number of CD4+ cells recorded in the ith row of the dataset (i=1, 2, …, 8,072).Assuming that the observations are independent, the data can be described by a Poisson or by a NB model to evaluate the effects of the covariates on the CD4+ counts, which can be defined as: where μ i denotes the mean number of CD4+ for the ith individual and 8,072 is the total number of observations, which refers to repeated measurements of 587 individuals in this study.The main difference between the Poisson and the NB models is the additional parameter and, consequently, the specification of the likelihood functions associated with them.The parameter estimation can be achieved via likelihood maximization by using a nonlinear optimization procedure [12].
Note that the traditional regression models for counting responses assume that the observations are independent.However, when clustered or longitudinal designs are used this assumption is no longer reasonable [13].
Models for Longitudinal Data.The models for longitudinal data are required when there are repeated measurements of the outcome for the same individual over time, which leads to a dependence structure in the data.The two approaches commonly used to analyze longitudinal data are the conditional and the marginal models [14].One of the most important conditional models for longitudinal data is the linear mixed or multilevel model, in which the coefficients have an individual or cluster-specific interpretation.This model is conditional on random effects that describe the behavior of a response that varies for a specific individual.In marginal models on the other hand, the dependent variable (outcome) is modeled separately from the correlation between the measurements of each sample unit (denoted as intra-unit or intra-individual correlation).Consider a generic notation, where m individuals that are followed-up may have n i repeated measures which can vary between individuals, and consider that index i denotes individuals and j indicates the observations.Using generalized estimating equations (GEE) as a marginal strategy, the expected marginal mean, E (Y ij ) i=1,...,m and j=1,...,n i , is modeled as a function of the explanatory variables [4] with via the definition of the "working" correlation matrices, which are shown in Table 1.
Multilevel Approach.The simplest multilevel model for count data considers a single random intercept effect that represents the differences between the individuals.Let X t ij = (X ij2 , X ij3 ,...,X ij(p-1) ) be the covariate matrix, t ij the time when the jth measure of the ith individual was taken, β= (β 2 , β 3 , ..., β (p-1) ) T , and b oi ∼ N (0, t o ) assumed to be independent of X t ij .Then, the linear predictor is given by: The model parameters are estimated by maximum likelihood using iterative methods such as the Fisher Scoring or the Newton-Raphson [13].
In many practical situations, it is reasonable to assume not only an average per individual, but also that the effects on repeated measurements of the response are dependent on random effects.Therefore more complex multilevel models that include two random effects can be suitable.For instance, the random coefficients given in (2) can be defined as: y 0i =β 0 +β 0i and Y 1i = β i +b 1i , where the random vector b i =[b 0i b 1i ] follows a bivariate normal distribution with mean 0 and covariance matrix .Again, upon Poisson or NB distributional assumptions for the response variable the parameters are estimated iteratively using the Newton-Raphson algorithm to maximize the likelihood function.The random effects can in theory take any probability distribution.However, for ease of computation, control and robustness of inferential processes the statistical packages restrict its use to particular cases.
Model selection is based on consistent Akaike information criterion (CAIC) defined as CAIC = -2L + p[log(mn)+1], where L is the loglikelihood function, n is the average number of repeated measurements, and p is the number of parameters [15].The model with the lowest CAIC is chosen.
Marginal Approach.GEE are extensions of GLMM's for correlated data and require only the correct specification of univariate marginal distributions provided one is willing to adopt a "working" correlation matrix [11].The linear predictor is specified as Independent: Assume that correlations for distinct measurements of the same individual are zero.This form is not adequate for longitudinal studies because their data are generally highly correlated.
Exchangeable: Assume that correlations between all repeated measurements of the same individuals are equal.
Autoregressive of order 1 (AR1): Assume that adjacent correlations are greater in magnitude.The intra-individual correlation over time is an exponential function of its length.For longitudinal data, this is the most parsimonious correlation structure because it depends on one single parameter and yet it enables the correlations to diminish over time.

Unstructured:
All n(n-1)/2 correlations of R i are estimated.This structure is more efficient and useful when there are only few time points.When there are several repeated measurements, the estimation of this structure is very complicated.Besides that, missing data makes it difficult to estimate R i .vector of fixed parameters associated with the covariate vector Z T ij= (1,t ij ,X ij2 ,X ij3 ,...,X ij(p-1) ).A link function that relates the marginal mean to the linear predictor is specified.In the case of the Poisson and NB distributions the canonical link function is the logarithm (log), i.e., μ ij =exp (Z T ij β * ).In this approach, the variance is written as a function of the mean [13].The estimates of β are obtained by the Newton-Raphson iterative method.Model selection is carried out based on the criterion of quasilikelihood under the independence model (QIC) [16].This criterion compares models with different correlation structures, such that the smallest QIC identifies the best model [17].

WORKING CORRELATION MATRICES COMMONLY USED IN MARGINAL MODELS CONSIDERING AN EXAMPLE
Computational Support.We used software R version 3.2.0[18] and Stata [19] version 10 to implement the methods described here.Details on syntax are presented in the Appendix.

RESULTS
In this study 63% of HIV seropositive individuals were males, 90% were under treatment at baseline and the average age of patients was 38 years (1 to 83 years).The mean follow-up was 4.6 years (3 months to 10.6 years).At baseline, 59% of patients had CD4 counts below 350 cells/mm³ and 64% had VL above 5,000 copies/ml.Overdispersion was detected in the data, indicating NB as the most appropriate model.However, for comparison and illustration of the methods described here, we present results for the NB and Poisson models.
The individual profiles graph for 10 randomly chosen patients is shown in Figure 1.Analyzing information displayed in Figure 1 we can gain insights regarding the variability between individual units at a given point in time; the variance within units over time; and the trends over time.Note that the space between the lines represents between unit variability and the change in each line (slope) represents within variability.We observe a wide variability in the number of CD4 and in the number of repeated measurements.
The relative risk estimates using GEE-Poisson and NB models, with different correlation structures, are presented in Table 3.According to the QIC, the best marginal model to fit this data is the NB with exchangeable correlation structure.It can be observed that patients with CD4 + counts above 350 cells/mm³ at baseline had a mean number of CD4 cells which was 43% greater than those with counts below that (RR=1.427;95%CI=1.326-1.552).Those patients who were undergoing treatment at baseline had an average number of CD4+ 38.0% greater than patients who were not undergoing treatment, controlling for the other covariates in the model (RR=1.379;95%CI=1.172-1.614).It is important to highlight that the interpretation of the estimates from Poisson and NB models are similar when using the same modelling strategy (marginal or conditional).
Figure 2 presents the estimated trajectories for the average number of CD4+ in accordance with the estimates obtained by GEE-NB model with exchangeable correlation structure for four patients with specific profiles for a period of 10 years.The first patient was not receiving treatment at baseline, male, 70 years old, with CD4 + at baseline below 350 cells/mm³ and VL above 5,000 copies/ml (Patient 1).The second patient also was not undergoing treatment at baseline, female, 35, CD4+ at baseline above 350 cells/mm³ and VL between 500 copies/ ml and 5,000 copies/ml (Patient 2).The third patient was undergoing treatment, male, 50 years old, baseline CD4+ below 350 cells/ mm³ and VL between 500 copies/ml and 5,000 copies/ml (Patient 3).The fourth patient was undergoing treatment, female, 20, CD4+ at baseline above 350 cells/mm³ and VL less than 500 copies/ml (Patient 4).The prediction equation is given as: log(μ i ) = 5.6173 + 0.0325 x time + 0.3189 x treatment + 0.0476 x gender -0.0009 x age + 0.3608 x CD4_baselina + 0-0126 x VL_dummy1 + 0.0041 x VL_dummy2 As expected, all four individuals had increasing average numbers of CD4+ cells over time according to their predicted individual profiles (Figure 2).Patient 1 had the worst performance.Interestingly, even though patient 2 was not undergoing treatment at baseline, she performed better than patient 3.This is due to other characteristics of these individuals,    It is worth noting that the interpretation of the results from marginal and conditional models differs.Although both approaches model the average number of CD4+ cells, the marginal model has a population-average interpretation while the multilevel model, being conditional on random effects, provides an individual interpretation.Therefore, the results from these models should not be directly compared.

RELATIVE RISK ESTIMATES FOR ANALYSIS OF LONGITUDINAL CD4+ CELL COUNTS IN HIV-SEROPOSITIVE
An important step in fitting a regression model is the verification of possible departures from the assumptions of the model.This diagnostic analysis is usually performed through residual analysis.However, due to the complexity of our data structure (unbalanced and unevenly spaced) there were no diagnostic methods available in R or Stata.Thus, their implementation represents a challenge for future work.

DISCUSSION
Analysis of longitudinal data using conventional regression models is inadequate as they fail to consider the dependence between observations over time.Longitudinal data may also present additional complexities in its structure, which may occur due to the imbalance and/or the fact that they are unevenly spaced, or owing to missing data.It is up to the data analyst to conduct a thorough exploratory analysis to evaluate the data structure and choose the statistical model that best suits it.For the analysis of count data in particular, the two most widely used statistical models are the Poisson and NB models.The choice depends on characteristics inherent to the data, for example the NB model is appropriate when overdispersion is suspected [20,21].The parameter estimates based on NB are not very different from those based on the Poisson model.However, the Poisson regression underestimates the standard errors when overdispersion is present, leading to inappropriate inference.A simple way to choose between these two models is to compare them based on some criteria, such as AIC, CAIC or QIC, depending on the adopted modeling strategy.Another way is to estimate the scale parameter from NB and to test the null hypothesis that it is equal to zero.
The choice of the modeling strategy The multilevel model, on the other hand, deals with the regression coefficients and the intrasubject correlation simultaneously in a single equation, in which the response is modeled as a function of the covariates and random effects [14].These methods are complex and some of them are still under development.
To date we know of no implementation of regression diagnostic methods for very complex data structures as the described in our application.It should also be noted that the distribution associated to the random effects varies according to the statistical software.Despite the limitations and methodological complexity, the use of LS with counts as responses is important in epidemiological studies, as is the case of our application.Other authors have monitored and evaluated the natural history of HIV using repeated measurements of CD4+, which were analyzed by using the multilevel linear model.To consider this type of modeling, an alternative embodiment is to use the CD4+ percentage as the outcome rather than a count [22] or using a transformation to the cell count (for example, square root of the number of cells) [23] so that the assumptions of normality and homoscedasticity of the errors are fulfilled.We implemented these strategies on our data but we found evidence of violation of the model assumptions.An additional limitation of using linear mixed models for longitudinal count data is that they do not enable the estimation of measures of association such as the relative risk.In addition to allowing the specification of a dependency structure between multiple measurements on the same individual, one of the advantages of the models for longitudinal count data described in this article is the possibility of including both fixed covariates, such as race or sex, and time-dependent covariates, such as type of treatment regimen or number of infections in the last quarter.
The methods described in this work enable the description of the impact of several factors on lymphocyte CD4+ counts in HIV-seropositive patients using all available information.We believe that this type of analysis can be useful to address several important issues in public health as well as help in monitoring patients and checking the effectiveness of their treatments.

APPENDIX: COMPUTATIONAL SYNTAX
The R software refers to a language and an integrated development environment for statistical calculations and graphics, being freely distributed and available at www.r-project.org 21.Stata software is a statistical program that was developed in C and released in 1985 22 .The most current version of Stata is 14.There are available versions of R and Stata for Windows, Macintosh, Linux and Unix.

Software R
To fit the multilevel Poisson models one can used the lme4 library through the glmer function.The Negative Binomial multilevel models are implemented using the glmmADMB library through the function glmmadmb.The syntax is similar to the previous one, substituting the name of the function and the argument concerning the distribution to: family="nbinom".
For the random intercept Poisson and NB models implemented in R, b 0i follows a Normal distribution.For multilevel models with two random effects the joint distribution of random effects b 0i and b 1i is considered to be bivariate normal in the software R.
To fit GEE using the Poisson distribution, the geepack library can be used along with geeglm function.The syntax considering the autoregressive (AR1) correlation structure is: geeglm(CD4 ~ time + treat + gender + age + CD4_baseline + factor (VL_baseline), data = database, family = poisson, id = id, corstr = "ar1") For other correlation structures one should only change the argument corstr for "exchangeable" or "unstructured".To date this function does not support the fit of NB models.

Software Stata
It is possible to fit the multilevel Poisson and NB models using xtpoisson and xtnbreg commands, respectively.The default distribution for the random intercept for multilevel Poisson model is the Gamma distribution, but one can alter that to normal distribution using the normal argument.The syntax for fitting the corresponding models is: xi Public Health -2015, Volume 12, Number 4 M O D E L L I N G L O N G I T U D I N A L C O U N T D A T A . The dependence structure among repeated measurements of the same individual are dealt e Public Health -2015, Volume 12, Number 4 Public Health -2015, Volume 12, Number 4 M O D E L L I N G L O N G I T U D I N A L C O U N T D A T A

FIGURE 1 INDIVIDUAL
FIGURE 1 INDIVIDUAL PROFILES FOR CD4+ COUNT IN 10 RANDOMLY SELECTED INDIVIDUALS.SALVADOR-BA, 2002-2012 INDIVIDUALS USING MULTILEVEL POISSON AND NEGATIVE BINOMIAL MODELS.SALVADOR-BAHIA.2002-2012 Public Health -2015, Volume 12, Number 4 M O D E L L I N G L O N G I T U D I N A L C O U N T D A T A RELATIVE RISK ESTIMATES FOR ANALYSIS OF LONGITUDINAL CD4+ CELL COUNTS IN HIV-SEROPOSITIVEINDIVIDUALS USING GEE-POISSON AND NEGATIVE BINOMIAL MODELS.SALVADOR-BAHIA.2002-2012 e 1 1 5 2 0 -7 especially for their CD4+ counts at baseline.

ACKNOWLEDGEMENTS:
This work was supported in part by Coordenação de Aperfeiçoamento de Pessoal do Nível Superior (CAPES)/Ministry of Education-Brazil (scholarship to D.B.T.), Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq) (grant 478556/2010-1 to L.D.A and grant 470810/2013-0 to R.O.) and Fundação de Amparo à Ciência e Tecnologia de Pernambuco (FACEPE) (grant APQ-0682-1.02/14).We are grateful to the Department of STD, AIDS and Viral Hepatitis of the Brazilian Health Ministry for allowing the use of their anonymous data, and particularly to Silvano B. Oliveira for his assistance during the project.CONFLICT OF INTEREST: none declared.E L L I N G L O N G I T U D I N A L C O U N T D A T A Public Health -2015, Volume 12, Number 4 M O D E L L I N G L O N G I T U D I N A L C O U N T D A T Axi: xtnbreg CD4 time treat gender age CD4_baseline i.VL_baseline, nolog pa corr(ar1) i(id) Other options for corr argument are exch, ind and unstr.The QIC can be calculated after installation of the function QIC.do.The syntax associated with the GEE-AR1 Poisson model, for example, is: qic CD4 time treat gender age CD4_baseline i.VL_baseline, nolog eform fam(poisson) corr(ar1) i(id)
Particularly to fit the random intercept Poisson multilevel model it is necessary to consider the argument (1 | id).The syntax is: On the other hand, for fitting the multilevel model with two random effects, being associated to the intercept and time variable, it is necessary to use the argument (time | id).In this case, the syntax is: glmer(CD4 ~ time + treat + gender + age + CD4_baseline + factor (VL_baseline)+ (time|id), data= database, family=poisson) : xtpoisson CD4 time treat gender age CD4_baseline i.VL_baseline, nolog re i(id) xi: xtpoisson CD4 time treat gender age CD4_baseline i.VL_baseline, nolog normal re i(id) For multilevel NB models, the default distribution of the random intercept is Beta.The syntax is given by: xi: xtnbreg CD4 time treat gender age CD4_baseline i.VL_baseline, nolog re i(id)Regarding the GEE approach, the functions used to fit the NB and Poisson models are, respectively, xtgee and xtnbreg.The syntax using AR1 correlation structure is: xi: xtgee CD4 time treat gender age CD4_baseline i.VL_baseline, nolog fam(poisson) corr(ar1) i(id)