Comparison of four analytic strategies for complex survey data: a case-study of Spanish data

Background: The aim of this secondary data analysis was to investigate the effect of four different analytical strategies: Model Based Analysis (MBA), Design Based Analysis (DBA), Multilevel Model Based Analysis (MMBA) and Multilevel Design Based Analysis (MDBA), on the model estimates for complex survey data. Methods: Using data from the World Health Survey-Spain explanatory models for the outcome, Metabolic Equivalent of Task (METs) were calculated using MBA, DBA, MMBA and MDBA. Regression coefficients, standard errors (SE) and the Akaike Information Criterion (AIC) from all the models were compared. Results: DBA gave the highest estimates for most of the variables, including consistently higher SE than all other models 20% to 48% higher than estimates for MBA, 10% to 37% for MMBA and 23% to 35% for MDBA. The SE for MDBA were 2.5% to 13% higher than estimates derived from MMBA in level 1 predictors, but SE in MMBA was higher by 18% for level 2 predictors. Values of AIC suggested the model derived by MDBA was the best fit and DBA the poorest fit of the four models. Conclusion: The MDBA appeared to be the most appropriate approach to analyse complex survey data on the basis that it had the lowest AIC. To confirm the findings of the present study a simulation study with hypothetical data would be required.


INTRODUCTION
Large epidemiological surveys almost universally employ multistage complex sampling procedures for data collection, where clusters (or primary sampling units -PSUs) are sampled at the first stage, sub-clusters at the second stage, etc., until final units (typically an individual) are sampled at the final stage [1,2].The World Health Analytic strategies for complex survey data Survey is an example of multistage complex sampling where districts were selected as PSUs, enumerated area as secondary sampling units (SSUs) and households as tertiary sampling units (TSUs) [3].Complex sampling strategies are used because they often make the process of estimation more efficient by reducing the cost of data collection for a given level of precision [4,5].Complex sampling strategies are particularly useful in the case of a population that is geographically dispersed, where a simple random sample would entail traveling significant distances and require greater time and effort for data collection.Complex sampling approaches almost inevitably result in unequal probabilities of individual selection, giving rise to the so called design features of complex survey data [1].
A complex sampling strategy, however, also imposes a multilevel or hierarchical structure on the data.For instance, in the World Health Survey described, above, a multilevel structure is present whereby individuals are embedded within a household, households are embedded within an enumerated area and enumerated areas within a district.Data from such complex surveys are therefore the product of both an underlying multilevel structure and the design features.If a multilevel structure has been imposed on the data through the sampling strategy, then that multilevel structure can itself become the focus of research.For example, multi-level analysis was used in a study of health care expenditure in which the authors estimated the simultaneous effects of individual-level and cluster-level characteristics on maternal health care spending [6,7].The rise of interest in multilevel analyses of hierarchically structured data introduces another dimension to consider in the analysis of complex survey data.
The combination of the presence of design features and multilevel data suggests four possible approaches to the analysis of data from complex survey design.The first approach is to analyse the data as if it was a simple random sample derived from the population ignoring both design features and multilevel structure of the data.This analysis can be termed a "model based analysis" (MBA) [8], for example the application of ordinary least squares regression (e.g., [9].The second approach is to take account of the design features and the clustering in the data, while still treating all predictors as if they are measured at the lowest level -a "designbased analysis" (DBA) (e.g., [10]).This would involve including the weighted sample to provide unbiased estimates of the independent variables in the regression model [11][12][13].The third approach would be to ignore the design features but instead focus on the multilevel nature of the data, allowing interpretations of individual and area level effects on individual outcomes using multilevel analysis (e.g., [14]).Such an approach would explain variation in the dependent variable at one level as a function of variables defined at other levels, plus interactions within and between levels [11], this could be described as a "multilevel model based analysis" (MMBA).Like its non-multilevel counterpart, the modelbased analysis may lead to biased estimates when employed in samples that include design features in the data [15].Finally, the fourth approach is an analysis in which both the design features and the multilevel nature of the data are taken into account -a "multilevel designbased analysis" (MDBA) (e.g., [16] and [17]).
Previous research has studied the effect on model estimates of ignoring a design-based analysis of data from surveys employing complex sampling strategies [12,18,19] and of ignoring multilevel structure in multilevel data [11].However, there has never been a systematic comparison of the effect of the four different modeling strategies (MBA, DBA, MMBA and MDBA) on the model estimates, when the data are collected using a complex survey design.Therefore, this study will investigate the effect of the MBA, DBA, MMBA and MDBA analytic strategies on model estimates from Spanish, World Health Survey (WHS) data.

METHODS
This study comprised the secondary analysis of a publicly available data set.Model estimates derived from four analytical strategies were compared: MBA, DBA, MMBA and MDBA.

Data source
The World Health Survey (WHS) is a large cross-sectional survey, that was administered in 70 countries between 2002-2003 to assess healthcare expenditure, adult mortality, birth history, risk factors, chronic health conditions and the coverage of health interventions [20].The WHS adopted several steps to ensure standardisation and comparability across diverse sites and times, including extensive interviewer training, standardised measurement tools and techniques, an identical questionnaire and instrument pretesting.The WHS's sampling frame covered 100% of a Spain's eligible population and no ethnic groups nor geographic areas were excluded from the sampling frame.The target population included any adult, male or female, aged 18 years living in private households, who were not out of the country during the survey period.The WHS used a multistage stratified design in most countries including Spain with each elementary unit having a defined probability of selection [20].WHS data is made freely available by the World Health Organization for secondary analysis by the research community.
The multistage sampling in the WHS used statistical enumeration areas as the Primary Sampling Units (PSUs).These were identified in the WHO sampling documentation as naturally occurring groupings with clear, non-overlapping boundaries [21,22].The strata chosen varied by country and reflected local conditions.Some examples of the factors that were used for stratification were geography (e.g.North, Central, South), level of urbanization (e.g.urban, rural), socio-economic zones, provinces (especially if health administration is primarily under the jurisdiction of provincial authorities), or presence of health facilities in the area.The PSUs were used as a clustering or grouping variable in the current analysis.In WHS, stratification was done at the first stage of the sampling.Once the strata was chosen and justified, all stages of sample selection were conducted separately in each stratum.More detailed information on the sampling approach can be found elsewhere [23].
Data from Spain was selected for this study based on the sample size (n=6364) and the number of PSUs (997).The Spanish WHS also had an extremely high response rate (95.5%) compared with other WHS countries.After excluding cases with missing data, the final sample for analysis was 6079 individuals.

Variables
The outcome variable used in this study was a measure of physical activity per week in units of Metabolic Equivalent of Task (METs).One MET is defined as the energy spent sitting quietly (equivalent to 4.184 kJ per hour per kilogram of body mass) [24].In the WHS, to assess physical activity respondents were asked to report the number of days during the last week on which they engaged in vigorous and moderate walking and the duration of such activities.Taking the different intensities and duration of the activities into account, a measure of energy expenditure per individual was estimated [25].METs were selected for this analysis specifically because of the high intra-class correlation for this variable (ICC=0.23).

Explanatory variables
Age, sex, education, occupation, fruit and vegetable intake, body mass index (BMI), household income and setting (urban/rural) were the key explanatory variables.They were selected on the basis of factors identified in previously published research exploring predictors of activity as assessed by METs [26][27][28].
Age was measured in years of life.Education was measured in number of years of schooling.Occupation was a categorical variable distinguishing "employed", "housewife", "retired" and "not working".BMI, defined as mass in kilograms divided by height in meters squared, was based on self-reported height and weight.The WHS did not contain a comprehensive nutrition survey measuring whole diets, but rather sought measurement of fruit and vegetable intake only.Two questions were used: "How many servings of fruit do you eat on a typical day?" and "How many servings of vegetables do you eat on a typical day?" [29].Household wealth was defined in terms of ownership of material possessions, with each individual assigned a wealth score on the basis of ownership of a range of household goods.Factor analytic procedures were used to provide a wealth score for each household and households were then divided into quintiles of wealth.The urban-rural nature of the PSU was provided in the WHS dataset based on local definitions.These urban/ rural PSUs were used as area level or level 2 predictors for multilevel analysis.Table 1 describes the explanatory variables used in all four models.

Analytical strategies
Four analytical strategies were developed.The first model was a model based analysis (MBA) which assumed the data were drawn as a simple random sample from the population.All predictors were treated as individual level attributes and no account was taken of the design features or the clustering of the data.The second model was a design based analysis (DBA) which took account of design features of the data, whereas all the predictors were treated as individual level attributes.The estimation was based on inverse probability weighting and design based standard errors.The third model was a multilevel model based analysis (MMBA).In this third model all the predictors were treated as level 1 predictors except the urban-rural predictor which was treated as a level 2 (i.e., PSU) predictor.Design features were not applied to the data, however, the multilevel nature of the data was considered where individuals were clustered within PSU.
The fourth model was a multilevel design based analysis (MDBA).The analysis took account of clustering as well as the design features.All analyses conducted using the M-Plus statistical package [30].DBA was performed using command "Analysis Type = COMPLEX" with input of "CLUSTERING", "STRATIFICATION" and "WEIGHTS" variables.MMBA was performed using command "Analysis Type = TWOLEVEL" with input of "CLUSTERING" variable.MDBA was performed using command "Analysis Type = COMPLEX TWOLEVEL" with input of "CLUSTERING" "STRATIFICATION" and "WEIGHTS" variables and urban and rural setting at level 2. Regression estimates and standard errors for all four models were compared.Additionally the AIC (Akaike Information Criterion) was also calculated for all the four models to measure the relative goodness of fit and to compare the best fitted model among the four models.

RESULTS
The descriptive statistics (weighted and unweighted) for the outcome variable (METs) and each of the explanatory variables are shown in Table 2.
Table 3 summarises the four models of METs using the 8 predictors.The parameter estimates, standard errors and level of significance for each model is shown as well as the AIC.
The average age of the population assuming a model based design is about 2 years older than the design based sample; the model based analysis also reported about 8% fewer METs.There was little difference in the estimated BMI. Differences may similarly be observed in years of school, the percentage in occupation and the percentage Analytic strategies for complex survey data in each wealth quintile.The highest difference of nearly 6% was seen in urban or rural settings, average BMI and level of fruit and vegetable intake.There were both consistencies and variations in the results of the analyses across the four models.At a superficial level, predictors that were identified as statistically significant in one model were, with few exceptions, identified as statistically significant in the other models.Gender, for instance was a significant effect across the four models and urban-rural setting was not a significant effect in any model.Education level was an exception -statistically significant in all models except the MBA.
The regression estimates did not show a consistent pattern across the models.DBA gave the highest estimates among the four models for most of the variables but showed lowest estimates for gender.That is, the effect of being male was about 32% less in the DBA than it was for the MMBA or MDBA.The estimates for the effect of BMI were lower for the MMBA and MDBA models and higher for the MBA and DBA models.The lowest variation between the models (15%) was seen in age.The urban-rural variable was not significant and showed extreme variation in the estimates.
For the significant effects of gender, occupation and fruit-vegetable intake, the estimates from the two multilevel models (MMBA and MDBA) were more consistent with each other than they were with the single level models (MBA and DBA).For the age and years of education estimates, however, the two design based analyses were more consistent with each other than they were with the model based analyses.For the estimate of gender, the model based and design based (non-multilevel) were reasonably consistent.Some interesting patterns were observed in standard error values across the four models.As one would expect, DBA estimates showed consistently higher standard errors than the other models: 20% to 48% higher than the MBA analysis, 10% to 37% higher than the MMBA and with the MDBA by 23% to 35%.On the other hand, MDBA had consistently higher standard errors by 2.5% to 13% in comparison to the MMBA model in level 1 predictors, but the standard error in the MMBA model was higher by 18% for the level 2 predictor.The MDBA also had a higher standard error when compared to the MBA model but the variation was comparatively smaller (from 3.7% to 12.5%).In comparison the MDBA model had a lower standard error by 6.8% in the occupation 'housewife' category.The AIC value was lowest for the MDBA model and highest for DBA.

DiSCUSSION
Four possible methods (MBA, DBA, MMBA, of data analysis for complex survey data were compared.On the basis of the fit of the models (AIC) the best analysis is, in order of best fit: MDBA, MMBA, MBA, followed by the DBA.
Data collection incorporating multi-stage sampling and design features is common in large epidemiological surveys.In the past it was relatively difficult to take account of the complex survey design in the data analysis, but recent advances in statistical software have made design based analysis accessible.Notwithstanding the availability of the software, judgments still need to be made about the best approach to take with the data.The results of this analysis raise important questions about how researchers should approach data derived from surveys with complex sampling designs.Although the model-based methods have gained popularity over the design-based methods as they can be readily implemented using standard commercial software, there is a consensus among statisticians that a straightforward MBA is inappropriate, since the common observation is that such an approach underestimates the uncertainty of the estimate.The analysis here, however suggests that the estimates themselves can vary substantially in magnitude although not in significance.The standard errors are, unsurprisingly, higher for the design-based approach in all the explanatory variables.This generally did not affect the statistical significance of the results, except in one case where the MBA estimate was non-significant.
In general the multilevel models tended to show greater agreement with each other than with the other models.Most of the estimates for MDBM were closer to MMBA as compared to other models, moreover, the AIC of the MDBA was closest to that of the MMBA model.The MMBA explicitly models the clustered nature of the data which should narrow the standard errors and eventually increase inferential accuracy.However incorporation of design features in MMBA increases standard error.Thus combining both design features and multilevel modeling leads to a standard error estimate that falls in between DBA and MMBA.Regarding the reliability of the estimates, the standard errors are lower with the MMBA for all of the variables.Therefore, the design-based analysis estimates are, overall, more precise than those from the model-based analysis.
It could be argued that the significance of variables in the regression models appears to be largely invariant, that more complex analytical methods offer little over simple MBA.Indeed, traditionally the availability of software and computing to conduct complex analyses was limited.If an analytic strategy is imperfect (as they all are) but yields the same general interpretation, does it matter if it is strategy A or strategy B is selected?However we have seen that the accuracy of such estimates will vary with the analytical method chosen which may be important in reducing overall error.
To confirm the findings of the present study future work could be performed on simulation data where hypothetical population data fitting a multilevel model could be created.A sample could then be drawn from this hypothetical population using complex survey sampling designs to compare the regression parameters derived from the four analytical approaches.

CONCLUSION
The four analytic strategies for complex survey data provide substantially different model estimates, standard errors and AIC.The lowest AIC was derived from the Multilevel Design Based analysis, which appears therefore to be the most appropriate approach to analyse complex survey data.
Previous research has studied separately the effect of complex survey design and hierarchical data structures on model estimates.
This study provides a systematic comparison of the combined effect of including consideration of the complex survey design and the hierarchical structure of the data on model estimates when the data are collected using a complex survey design with multilevel elements.
One limitation of this study is that the data were derived from a real life survey.Simulation data could yield a more accurate comparison of the methods, through generating a hypothetical population fitting the multilevel model.

TABLE 1 .
Summary of the statistical methods and variables used in the analyses

TABLE 2 .
Unweighted and weighted descriptive statistics of outcome and explanatory variables.

TABLE 3 .
Multivariate linear regression analysis showing MET association with various micro and macro level explanatory independent variables, with and without consideration of sampling design.