A note on prognostic accuracy evaluation of regression models applied to longitudinal autocorrelated binary data

Background: Focus of this work was on evaluating the prognostic accuracy of two approaches for modelling binary longitudinal outcomes, a generalized Estimating Equation (gEE) and a likelihood based method, Marginalized Transition Model (MTM), in which a transition model is combined with a marginal generalized linear model describing the average response as a function of measured predictors. METhods: a retrospective study on cardiovascular patients and a prospective study on sciatic pain were used to evaluate discrimination by computing the area under the receiver-operatingcharacteristics curve, (auc), the Integrated discrimination Improvement (IdI) and the net reclassification Improvement (nrI) at different time occasions. calibration was also evaluated. a simulation study was run in order to compare model’s performance in a context of a perfect knowledge of the data generating mechanism. rEsulTs: similar regression coefficients estimates and comparable calibration were obtained; an higher discrimination level for MTM was observed. no significant differences in calibration and MsE (Mean square Error) emerged in the simulation study; MTM higher discrimination level was confirmed. conclusIons: The choice of the regression approach should depend on the scientific question being addressed: whether the overall population-average and calibration are the objectives of interest, or the subject-specific patterns and discrimination. Moreover, some recently proposed discrimination indices are useful in evaluating predictive accuracy also in a context of longitudinal studies.


InTroducTIon
Both in retrospective and prospective observational studies, repeated measurements are often taken over time to evaluate longitudinal dynamics of an outcome and factors affecting this evolution.In some instances, outcomes are binary indicators of presence of a particular condition.Statistical methods for analysis of longitudinal binary data have been rapidly developed during recent years; Diggle et al. provide the detailed review [1].Either subject-specific and populationaverage estimates could be a major goal in this context.The data that motivated this manuscript come from two different research fields: a cardiovascular (CV) cohort of 503 idiopathic dilated cardiomyopathy (DCM) patients, enrolled from 1988 to 2007 in the Heart Muscle Disease Registry of Trieste, a database of a tertiary referral center on cardiomyopathies [2] and a database from an occupational medicine department (OM).The latter consisted in a prospective cohort study on 537 male professional drivers conducted over a two-year period (2004)(2005)(2006) to investigate the relation of sciatic pain to measures of internal spinal load [3].In CV patients, timing of implanted cardioverter defibrillator (ICD) for primary prevention of sudden death (SD) is, to this day, not clearly established: waiting some time could reduce useless implants but also increase the exposure to SD, and no guidelines are available.General criteria for ICD eligibility in DCM patients are Left Ventricular Ejection Fraction (LVEF) ≤0.35 and NYHA (New York Heart Association) functional classes II or III ("SCD-HeFT criteria": Sudden Cardiac Death Heart Failure Trial, [4]).The aim of this retrospective cohort study was to explore the longitudinal evolution of the presence/absence of SCD-HeFT criteria after therapy optimization in order to identify the optimal waiting time before implantation.In the OM prospective study, measures of daily whole body vibration (WBV) exposure and spinal load were derived from the biodynamic and epidemiological databases implemented by the German and Italian arms of the EU VIBRISKS project [5][6].The main goal was on determining the boundary values for the internal lumbar load to prevent the occurrence of sciatic pain from a public-health intervention point of view.
The performance of the two following regression approaches were compared on the above datasets: a Generalized Estimating Equation, GEE method [7], and a likelihood based method, Marginalized Transition Models, MTM [8].The former, adopts a semiparametric model in which only the marginal mean and the correlation of repeated measurements are specified.In the latter, a transition model that characterizes serial dependence is combined with a marginal generalized linear model that describes the average response as a function of measured predictors.Models' performances were assessed by calibration and discrimination.Calibration addresses the question of how closely the model-based risk estimates align with the observed outcomes.Discrimination focuses on a model's ability to distinguish between subjects who will (or did) develop the event of interest from those who will (did) not.A simulation study was also run, in order to compare performance of the above regression models in a context of a perfect knowledge of the true data generating mechanism.

METhods real data modelling
We observe serial binary response data on subjects at times .In CV data, it Y is a binary indicator of the SCD-HeFT condition at time t for subject i; in OM data, it Y is a binary indicator of sciatic pain at time t for subject i.We also observe a set of constant or time-varying covariates, respectively and recorded for each subject at baseline and at each occasion.In CV data, a constant covariate indicates the initial condition of the subject, 'InitialCond', coded as an ordinal variable: 1=No SCD-HeFT at baseline, 2= SCD-HeFT at baseline, 3=NYHA IV at baseline, based on the increasing severity of the starting condition, and a time effect ('Time') that assumes values from 0 to 48 (time scale is in months), depending on the number of observations per subject.In OM data, a constant covariate indicates age at entry in the study ('Age'); time-varying continuous covariates are: years of exposure at "whole-body vibration" ('WBV') in decades and measures of internal spinal load as the daily compressive dose 'Sed' (in MPa), and a risk factor 'R' (non-dimensional units) that measures the adverse health effects related to the compressive dose, both measured at six lumbar spine levels [3,9].There are also time-varying categorical covariates that represents individual and work-related risk factors: physical work load ('Posture', with four levels: mild, moderate, hard, very hard) and psychosocial work environment ('Psycho', with four levels: good, acceptable, a little bit bad, bad); finally, the time effect ('Time', with three occasions corresponding to surveys).
To obtain estimates for the regression of Y it on covariates we assume that the regression model properly specifies the full covariate conditional mean defined as Since the outcome is a binary condition, the marginal generalized linear model specifies: g(μ it )=X it β+Z i λ where g() is the logit link function and the regression coefficients β,λ measure the influence of covariates on the average response.We have to take into account the withinsubject correlation that naturally arise from repeated observations on the same patients.For CV data, the following two regression models were estimated: (CV.a) : (CV.b): Two regression models were estimated for OM data, in order to compare the two different measures of spinal load available: (OM.a): (OM.b): In model (OM.b) covariates Age and WBW were excluded since they are both present in the computation of the risk factor R L5S1 .The choice of the spinal levels L5-S1 for the measures of internal spinal load was based on a preliminary analysis among all six spinal levels that indicated Sed L5S1 and R L5S1 as the most representative of the extremely correlated group of spinal levels linearly affecting the outcome.Each of the above models was estimated with the two different regression approaches: the GEE method [7,10] where specification of a working correlation structure is required to take account of the correlation and main choices are among independence, exchangeable, autoregressive or stationary.In the present study, the autoregressive of order one (AR1) working correlation structure was adopted, in analogy with the second regression approach, MTM.These are a general parametric class of serial dependence models that permit likelihood based marginal regression analysis of binary response data [1,8] where covariate effects and within-subject association are modelled through a single equation: given the immediate previous response in a first-order Markov model [11] the current response variable is assumed to be conditionally independent of any previous outcome variables .The probabilities that define the first order Markov process are given by , i.e. the probability to observe an outcome at time t if the outcome condition was absent or present in the previous follow up.The first-order MTM adopted in the present study is specified by assuming a regression structure for the marginal mean, using the above defined generalized linear model g(μ it ), and this model is constrained by the transition probabilities to satisfy .A parameter of serial dependence of Y it on Y it 1 is estimated that measures the log odds to have the outcome at time t among subjects who had the outcome at time t-1 compared to subjects who had not: α= log(OR(Y it , Y i, t-1 )) where OR stands for odds ratio.R packages 'geepack' [12] and 'mtm' were used to fit the models.

Prognostic accuracy evaluation
The relative behaviour of GEE with respect to likelihood based models had already been exploited in various context of study design and inference objectives, with the help of simulation studies and real data applications [13][14]; but specific considerations about discrimination and calibration have not been particularly explored in the literature.To evaluate discrimination, i.e. the model's ability to distinguish between subjects who will develop the event or condition of interest from those who will not, three discrimination indices were computed.First of all, the Area Under the Receiver-Operating-Characteristics curve, AUC: as it is well known [15], AUC represents the probability that given two subjects, one who will develop an event and the other who will not, the model will assign a higher probability of an event to the former.To take into account the longitudinal structure of the data, AUCs were computed separately at the different times for the two approaches, where Model x =GEE or MTM, t=time: where is the probability of event estimated by the model for subject i at time t.AUCs values range from 0.50 (useless) to 1 (perfect discrimination).The De Long test was calculated to evaluate differences among ROC curves [16].
The recently introduced [17] Net Reclassification Index (NRI) and the Integrated Discrimination Improvement (IDI) were also computed following the longitudinal structure of the data.NRI evaluates the relative increase in the predicted probabilities for subjects who experience the event and the decrease for subjects who do not.NRI is the sum of two components, 'NRI for events' and 'NRI for non events': if we indicate the MTM model-based vector of predicted probabilities of event at time t by t MTM and the GEE model-based vector of probabilities by t

GEE
, NRI at time t was estimated as: where the first term ('NRI for events') quantifies improvement in sensitivity and the negative of the second term ('NRI for non events') quantifies improvement in specificity.Heuristic benchmarks to interpreting NRI values are: NRI around 0.20 indicates a weak improvement; around 0.40 a medium improvement, and greater than 0.60 a strong improvement [18].
IDI is the difference in discrimination slopes, i.e. the difference between mean predicted probabilities of an event for those with events and the corresponding mean for those without events; thus, it jointly quantifies the overall improvement in sensitivity and specificity, and it can be estimated at time t as: where is the mean predicted probability of Model x (GEE or MTM) at time t in the group of subject with events, and is the mean predicted probability in the group of subject without events.Values of IDI around 0.006 indicates a weak improvement, around 0.04 a medium improvement and around 0.10 a strong improvement [18].For NRI and IDI asymptotic tests for the null hypothesis of no difference between models can be applied, and 95% confidence intervals have been computed.'Calibration at large' was also evaluated, calculating means of the predicted risks by the two approaches at different times while comparing them with the observed outcomes.

simulation study
In order to complement the illustration from the case studies we also ran a brief simulation.We fixed regression parameters β and generated three covariates.The first, associated with a negative slope, recorded time from baseline measurement, the second was a time-varying indicator generated from a fair Bernoulli trial, and the last one a time-fixed continuous covariate generated from a standard Gaussian.In order to sample the outcome we adopted the approach described in [19], which is based on the marginal means arising from the above described fixed effects and covariates.A firstorder autocorrelation (AR1) structure was used.Hence, we did not generate data from any of the two models afterwards used for fitting.We fixed T=4 time occasions and repeated data generation, model estimation and evaluation of the results for different values of the sample size n (250,500,1000), and correlation coefficient rho (0.1,0.25,0.33,0.5).The results are averaged over 1000 replicates.

CV data
At diagnosis and at each follow-up study, population was divided into two groups by the binary condition "SCD-HeFT criteria" yes or no.The first four evaluations were considered, on average at 6 (range 3-9), 12 (9-18), 24 (18-36) and 48 (36-60) months after diagnosis.Patients with all measures were defined as "Complete Cases" (CC).Patients with some follow-up visits but not others were defined as "Intermittent Drop-out" (ID); 'Drop-out at Death' (DD) were those fully observed until death, and finally "Baseline Drop-out" (BD) were those that drop-out after baseline, but were known to be alive after 48 months.CC group counted 165 (33%) patients; ID were 221 (44%); 47 patients (9%) were DD, and 70 (14%) were BD.Independently from the number of patients included (from all patients until 'CC') the rate of temporal decline in the log-odds of SCD-HeFT estimated by GEE and MTM models by using (CV.a) was comprised between -0.02 to -0.03 per time increment (odds ratio between 0.97 and 0.98 per month, nearly 0.70-0.76 in 12 months).Following (CV.b),PROGNOSTIC ACCuRACy IN AuTOCORRELATEd REGRESSION mOdELS the initial condition was highly relevant in increasing the probability to be or not SCD-HeFT at successive follow-up (log-odds from 1.5 to 2.4, depending on the subgroups used, table 1).The MTM serial dependence of first-order was always significant and strong: the log odds to have SCD-HeFT at time t among subjects who had SCD-HeFT at time t-1 with respect to subjects who had not was estimated between 1.7 and 2.2.The AR1 within-subject correlation estimated by GEE was also significant and strong for both models (between 0.49 and 0.51).Considering all patients, the observed prevalence of SCD-HeFT at baseline was 52% , that sharply decreased at 23% at 6 months, and remained quite stable until 48 months (respectively at 12 and 24 months: 19% and 18%, and 20% at 48 months); the others sub-groups of patients showed very similar trends (Supplementary Figure 1(a)).Calibration at large, expressed as estimates of prevalence at different times, was biased towards lower values at baseline and higher values at 6-12 months, both for models (CV.a) and (CV.b) and for both regression approaches; more precise estimates at the end of follow up were obtained, at 24 and 48 months.In particular, at baseline the estimated prevalence was 41% for GEE and 42% for MTM (both in models CV.a and CV.b); it decreased respectively at 34% and 28% at 12 months and at 17% and 13% at 48 months, showing an higher precision of GEE at the end of follow up and lower in the middle (Figure 1 (a)).

OM data
Characteristics of the study population have been already extensively described in various papers [4,21].Briefly, the cohort included male professional drivers (n=628) employed in several industries and public utilities located in various Provinces of Italy.The rate of participation in the initial crosssectional survey was 95.2% (n=598); five hundred and thirty-seven responders participated in the successive follow up surveys over the calendar periods 2004-2006.Causes for loss to the follow-up were change of residence (n=15), consent refusal (n=36), and undetermined reasons (n=10).Not all 537 subjects participated in all the three surveys: 80% in the first, 89% in the second and 83% in the third; all subjects had at least two measures, but only 59% of them had also a third measure, for a total of 1391 observations.Moreover, some subjects had missing values for several covariates included in the regression models, therefore the total number of observations included in the multivariable models was 1359.At the first survey, the mean age at entry was 41 (sd=8), the average years of exposure at "wholebody vibration" ('WBV') in decades was 1.7 (sd=1), the daily compressive dose Sed L5S1 averaged at 0.29 (sd=0.1)MPa, and the risk factor at 0.27 (0.13) units.Physical work load ('Posture') was "hard" for the 20% of subjects and "very hard" for the 17%; psychosocial work environment ('Psycho') was "bad" for 15% of subjects.All these time-varying factors remained quite stable during successive surveys, except for an increase observed in the percentage of the "very hard" physical work load (increased at 27-28% in the second and third survey).
For model (OM.a), regression coefficients of GEE and MTM were very similar: an OR of 1.3 for a unit increase of WBW was estimated, postures 'hard' and 'very hard' were significantly affecting the outcome with respect to the level reference 'mild', as well as psychosocial work environment defined as 'bad' showed an OR between 1.5 and 1.6 with respect to the level reference 'good'.For a change of 0. PROGNOSTIC ACCuRACy IN AuTOCORRELATEd REGRESSION mOdELS estimated OR was 1.2.Also for model (OM.b), regression estimates for GEE and MTM were nearly equivalents: time become borderline significant, with an OR of 1.12, postures 'hard' and 'very hard' were still significantly affecting the outcome with respect to the level reference, instead the psychosocial work environment was no longer significant.For a change of 0.1 units in R L5SL the estimated OR was 1.3.The AR1 within-subject correlation estimated by GEE was always significant, at a level of 0.5.The MTM serial dependence of first-order pointed strongly to serial dependence: the log odds to have sciatic pain at time t among subjects who had sciatic pain at time t-1 with respect to subjects who had not was estimated at 2.2 (table 2).
At the first survey, the observed percentage of subjects with sciatic pain was 22.5% and it increased at 27.5% in the second survey and at 27.7% in the third (Supplementary Figure 1 (b)).Both for models (OM.a) and (OM.b) calibration of GEE and MTM was satisfactory: mean predicted probabilities were SCREENING IN THE GENOmIC ERA respectively 23% and 23% at the 1 st , 26.2% and 26.1% at the 2 nd , 27.9% and 28.7% at the 3 rd survey (Figure 1 (b)).Both methods showed a similar trend for the estimated prevalence, over-estimating at the 1 st and 3 rd occasion and slightly under-estimating at the 2 nd survey.An higher precision of GEE was observed in the 3 rd survey, at the end of the study.

CV data
Considering all patients, the discrimination level of MTM was always significantly superior to GEE: AUCs for MTM computed at different follow-up times ranged from 0.65 to 0.80 following (CV.a) and nearly the same following (CV.b).Of note, for (CV.a),AUCs computation are useless for GEE, since GEE predicted probabilities have the same value for all patients, being 'Time' the only covariate in the regression model (and it is at a fixed value computing AUCs separately at different follow-up times).For (CV.b), AUCs values for GEE were always around 0.65, significantly lower with respect to MTM (De Long test p values always <=0.001, except at 6 months).NRI values showed more or less the same trend of AUCs: starting with 'lower' (but high in the NRI interpretation scale) values at 6 months (around 0.60 for both CV.a and CV.b), then   1).IDI showed a different behaviour in the two regression models: in case of (CV.a),IDI values were higher than in case of (CV.b), indicating that an improvement both in sensibility and in specificity was achieved by GEE using (CV.b), even if showing significantly lower values with respect to MTM (figure 2 and supplementary table 1).Results for the other groups of patients were in line with 'all population' (supplementary table 1).

OM data
Except for the first survey, where no significant differences were observed, the discrimination level of MTM was always significantly superior to GEE: AUCs for MTM computed at the 2 nd and 3 survey ranged from 0.76 to 0.78, following (OM.a) and (OM.b), with respect to the GEE AUCs range of 0.64-0.67(figure 3 and supplementary table 2, De Long p<0.001).NRI values were always significant and showed the same trend both for model (OM.a) and (OM.b): a nearly linear increase between the first and the last survey (figure 4 and supplementary table 2).The time trend for IDI values was even more impressive, starting with low values at the first survey (0.002-0.003) and increasing at 0.140-0.200at successive surveys (figure 4 and supplementary table 2).No relevant differences between models (OM.a) and (OM.b) were observed, indicating a similar role of the two different measures of internal spinal load available.

simulation results
With respect to the estimated MSE (Mean Square Error) and in line with expectations, a decreasing trend was observed for increasing sample size.The two methods yielded approximately  3).
In Figure 5 we report results corresponding to n=500 for visual illustration.Both Table 3 and Figure 5 show that when the autocorrelation is low (rho=0.1),MTM is nearly equivalent to GEE.As the autocorrelation grows, the predictive accuracy of MTM gets better and better than GEE in terms of AUCs, NRI and IDI.A stable pattern is seen across time occasions, where on the first occasion the two methods perform similarly, and then they tend to lead to different predictions at later times.Calibration results were nearly the same for the two methods.

dIscussIon
Analysis of longitudinal data is challenging due to the need to address correlated outcomes, missing data and time-varying exposure processes.In this paper we focused on the ways to evaluate prognostic  SCREENING IN THE GENOmIC ERA accuracy and discrimination of two regression methods.The former, GEE, is quasi-likelihood based in that it only requires specification of the mean and variance, not the entire probability distribution [7,10].The latter, MTM, allows for simultaneous likelihood-based estimation of the average response and for the serial dependence among longitudinal observations.The latter model is permissible both when subjects have varying lengths of follow-up and when data may be missing at random [14], as it could be considered the case in the OM cohort.Instead, due to the presence of 'non-ignorable' drop outs due to deaths in CV data, a sensitivity analysis considering all different group of patients was performed, and no relevant differences in estimates and accuracy across groups between the two methods emerged.
Estimates of the regression coefficients were similar for the two approaches, both on real case studies and in the MSE values derived from the simulation.However, evident differences were observed in discrimination, determined by the better characterization at the individual-level of a serial dependence that compared a person's potential outcomes under different paths of interest in MTM.Calibration at large was globally similar between the two models, particularly suffering in the CV data for the suboptimal linear modelling of the time effect, and showing in both case studies a higher precision of the GEE approach at the end of the longitudinal path.No significant differences in calibration between the two methods emerged from the simulation study.
Results of the present study confirm that the practical choice of the regression approach should depend on the scientific question being addressed: in CV data the primary goal was in the individual estimation of the right timing of ICD implantation; a "retrospective" scenario was considered in which longitudinal data were available: by looking at transitions between couple of time points an important improvement in the first 6-12 months and then a stabilization during the subsequent period was observed, even though a little, clinically negligible, worsening towards the 4th follow-up PROGNOSTIC ACCuRACy IN AuTOCORRELATEd REGRESSION mOdELS year was present, possibly due to the natural progression of the disease.The important role of the initial condition together with a not ignorable serial dependence across repeated measurements both emerged as strong individual predictors of longitudinal evolution.In this case, the MTM approach should be preferred with respect to GEE.For OM data, more than subject-specific, the cross-sectional population-average estimates lend themselves to an important interpretation: regulatory authorities would generally be mainly interested in determining the boundary values for the internal lumbar load to prevent the occurrence of sciatic pain that must show efficacy for the average population under study, and in this respect the GEE approach could be safely used.
Last but not least, the three measures used to evaluate model's performance, the Area Under the Curve (AUC), the Integrated Discrimination Improvement (IDI) and the Net Reclassification Improvement (NRI), in its continuous version, offer complementary information to evaluate discrimination and have been here extended, for the first time at our knowledge, to the case of longitudinal binary regression models.
Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 4 SCREENING IN THE GENOmIC ERA Public Health -2014, Volume 11, Number 4 PROGNOSTIC ACCuRACy IN AuTOCORRELATEd REGRESSION mOdELS Epidemiology Biostatistics and Public Health -2014, Volume 11, Number 4 SCREENING IN THE GENOmIC ERA

sUpplEmEntaRY FIGURE 1 (
a): obsERvEd pREvalEncE oF scd-HEFt cRItERIa, cv data.(b) obsERvEd pREvalEncE oF scIatIc paIn, om data.

FIGURE 2
FIGURE 2 Top-down: AUCs (solid lines) And Corresponding 95% Ci (dAshed lines); in red And TriAngles MTM vAlUes, in blACk And doTs gee vAlUes, for Cv.A And Cv.b. nri vAlUes (solid lines) And Corresponding 95% Ci (dAshed lines) for Cv.A And Cv.b. idi vAlUes (solid lines) And Corresponding 95% Ci (dAshed lines) for Cv.A And Cv.b.
Public Health -2014, Volume 11, Number 4 PROGNOSTIC ACCuRACy IN AuTOCORRELATEd REGRESSION mOdELS increasing and peaking between 12 and 24 months (from 1.18 to 1.23 respectively for CV.a and CV.b) and then decreasing at 48 months (from 0.91 to 0.62, CV.a and CV.b).NRI indices were always significant, except at 48 months for model (CV.b) (figure 2 and supplementary table

FIGURE 4 nRI
FIGURE 4 FIGURE 5 1 MPa in Sed L5S1 the

tablE 2 om
data: GEE and mtm REsUlts: EstImatE (sE).In ItalIc not sIGnIFIcant EstImatEs