- Research Article
- Open Access
- Open Peer Review
Application of the varying coefficient model to the behaviour risk factor surveillance data in Italy: a study of changing smoking prevalence among sub-populations
- Shireen Assaf^{1, 2}Email author and
- Stefano Campostrini^{3}
https://doi.org/10.1186/s12889-015-1805-3
© Assaf and Campostrini; licensee BioMed Central. 2015
- Received: 18 September 2014
- Accepted: 27 April 2015
- Published: 13 May 2015
Abstract
Background
Behaviour risk factor surveillance (BRFS) data can be an important source of information for studying changes in various health outcomes and risk factors. Results obtained from surveillance data analysis are vital for informing health policy interventions, particularly with regards to evolutionary aspects. The objective of this analysis was to recommend a method that can be used for analysing trends in the association among variables from large public health data sets. This was demonstrated by examining the changing effects of various covariates, representing different sub-populations, on smoking status over time.
Methods
In our work, we propose the use of varying coefficient models (VCM) with non-parametric techniques to catch the dynamics of the evolutionary processes under study. This is a useful method, which allows coefficients to vary with time using smooth functions. Italian BRFS data from 2008-2012 was used with a sample size of 185,619 observations. In the application, a time VCM is fit for a smoking status binary outcome variable using the P-spline estimation method. The model includes ten independent variables comprising socio-demographic, health risk and behaviour variables.
Results
The VCM fit for the data indicates that the coefficients for some of the categories for the age and the alcohol consumption variables varied with time. The main results show that Italians aged 18-29 and 40-49 had higher odds of being smokers compared to those aged 60-69; however, these odds significantly decreased in the period 2008-2012. In addition, those who do not drink had lower odds for being a smoker compared to high risk drinkers and these odds decreased further during the observation period.
Conclusion
The application of the VCM to the BRFS data in Italy has shown that this method can be useful in detecting which sub-populations require interventions. Although the results have shown a decrease in the odds of being a smoker for certain age groups and non-drinkers, other sub-populations have not decreased their odds and health inequalities remain. This observation indicates that efforts and interventions are still required to target these non-changing sub-populations in order to modify their smoking behaviour.
Keywords
- P-spline
- Big data
- PASSI
- Odds ratio plots
- Health promotion policy and evaluation
Background
Public health surveillance systems have a long history, starting with the collection of data for monitoring infectious diseases to the study of non-communicable diseases in more recent years [1]. The data collected can vary from one country to another, where some collect detailed health risk and behaviour information related to non-communicable disease such as the United States of America Behavioural Risk Factor Surveillance System (US BRFSS), and the similarly designed Progressi delle Aziende Sanitarie per la Salute (PASSI) Italian surveillance system. The common characteristic of all surveillance systems is to have a continuous data collection system [2] usually by taking a new and independent random sample at each time period and therefore individuals are not followed as in longitudinal studies. Some surveillance systems collect data more frequently than others do. The US BRFSS system, for example, has been collecting data since 1984, where a new random sample is taken each month for a telephone interview [3]. The Italian PASSI surveillance system, initiated in 2007, also collects data on a monthly basis and is ongoing. Therefore, these types of surveillance systems are producing data that are continuously increasing in sample size. Surveillance systems can be considered to be another form of big data which is proving to be an important source of information, particularly when statistical well-thought methods [4] can take advantage of the availability of this rich data, which is very informative especially for evolutionary analysis.
More specifically, the analysis of Behaviour risk factor surveillance (BRFS) data can be very useful for examining trends of health outcomes and risk factors since these data are collected for long periods of time in an almost continuous fashion. Although often a simple time trend analysis answers many possible questions (e.g. is smoking increasing or decreasing?), sometimes the substantial questions are much more complex and ask for more sophisticated models. This is usually the case when wanting to answer the question “why is there a change?” or examining how the relationships among some variables are changing over time. The richness of BRFS data could offer such information [5], although, so far, rarely has it been properly addressed with sufficiently advanced statistical approaches. In the traditional trend analysis methods, such as in time series analysis, observations are assumed to be dependent and need to be aggregated before analysis, for instance to analyse changes over time in means or proportions. The aggregation of data can cause some loss of information on the variability between the observations, since it typically uses only monthly means. The assumption of dependence requires the model to be reformulated in order to take into account the dependence structure in the data. However, in BRFS data the samples are taken randomly each month and therefore the observations can be assumed to be independent. In addition, the usual methods used for the analysis of trends assume that each variable has one parameter to be estimated for the entire period of observation; therefore assuming that the parameters are constant with time. The proposed method of using varying coefficient models (VCM) for analysis of surveillance data allows one variable to vary by another therefore creating coefficients that vary with the modifying variable, which can be time. The method also does not require the aggregation of observations and does not require the assumption of dependence between the observations. The VCM can provide a flexible model for studying the dynamic and evolutionary BRFS data, and can be used to answer a different type of question in trend analysis. This question is not specifically concerned with studying the trends of the outcome, but in studying the trends in the effects (i.e. the coefficients), which can give us a deeper understanding of the changes in the subgroups of the population with respect to the outcome of interest. These models are well established in the literature and their use for the study of trends has been demonstrated by many authors, however with the use of longitudinal data [6-11]. While Young et al. [12] used the US BRFSS surveillance data as well as other environmental data for fitting a spatial varying coefficient model for the study of the spatial variation of the effect of ozone levels on myocardial infraction occurrence. However, to the best of our knowledge, the use of varying coefficient models for studying time varying coefficients using behaviour risk factor surveillance data, or similar health related big data, has not yet been performed. Therefore, there is potential to utilize this data even further which will be demonstrated in this analysis.
The aim of our study is to evaluate the applicability of the varying coefficient model approach to BRFS data. First to examine its feasibility from a computational point of view due to the growing sample size of these data collection systems, but more importantly, its relative capability of offering possibly more informative and readable results. The method was applied to studying smoking status in Italy as the outcome of interest.
Data and methods
Description of the data
The data used for the analysis is from the PASSI (Progressi delle Aziende Sanitarie per la Salute or Progress in the Italian Local Health Units) surveillance system in Italy [13], the details of which are described by Baldissera et al. [14], as well as Minardi et al. [15] and Binkin et al. [16], who have used the PASSI data for analysis of health outcomes and risk factors. PASSI raw data are not openly available outside the PASSI network; however, their use for research purposes is encouraged once permission is obtained from the National Coordinating Group at the Italian National Institute of Health (Istituto Superiore di Sanità). No experiments involving human subjects was performed for the PASSI surveillance system and approval for the PASSI was provided by the Italian National Health Institute ethical board. Data collection for PASSI is conducted through the local health units. These units are found in all the 21 Italian regions and comprise the basic structure of the National Italian Health System (a public universal health system). Each region in Italy has between 1 and 22 local health units. The coverage for the data collection are all registered users of the participating local health unit. PASSI data collection began in 2007 and is still ongoing and is conducted by each local health unit participating in the surveillance system, which is over 90% of the Italian local health units. To collect the data, a monthly random sample with replacement is chosen from a list of residents in each local health unit aged 18 - 69 years and a telephone interview is conducted with those selected. The sample selection is stratified based on an age and sex distribution weighting system in order to adjust for the different population sizes of the local health units. Each month a new random sample is selected from the list of residents, therefore the same individuals are not followed over time as in longitudinal studies.
The questionnaire used in the telephone interview covers a wide variety of behavioural and preventive topics, and the same questionnaire is used in all the Italian regions. The wording of the questions has been kept relatively constant over the years except for a few variables not included in the analysis. Any slight change in the wording of the questions has been taken into account before creating the variables required for analysis. For performing the analysis, the monthly data are combined for all the years from 2008 to 2012.
Description of variables used in the analysis for the period between 2008-2012
Variable | Categories | Number of | % |
---|---|---|---|
observations | |||
Smoking status | Smoker | 51696 | 27.9 |
Non-smoker | 133923 | 72.1 | |
Age | 18-29 | 33450 | 18.0 |
30-39 | 38660 | 20.8 | |
40-49 | 44193 | 23.8 | |
50-59 | 35646 | 19.2 | |
60-69 | 33670 | 18.2 | |
Sex | Male | 91160 | 49.1 |
Female | 94459 | 50.9 | |
Marital status | Married | 112853 | 60.8 |
Single | 57959 | 31.2 | |
Widowed or divorced | 14807 | 8.0 | |
Education | University or higher | 24700 | 13.3 |
High school | 82324 | 44.4 | |
Middle school | 58063 | 31.3 | |
Primary school or less | 20532 | 11.0 | |
Income | High | 87989 | 47.4 |
Medium | 74159 | 40.0 | |
Low | 23471 | 12.6 | |
Work status | Works | 107648 | 58.0 |
Does not work | 77971 | 42.0 | |
Region | North | 92741 | 50.0 |
Central | 45142 | 24.3 | |
South | 47736 | 25.7 | |
Physical activity | Active | 61357 | 33.1 |
Partially active | 70800 | 38.1 | |
Sedentary | 53462 | 28.8 | |
Alcohol consumption | High risk drinker | 18852 | 10.2 |
Low risk drinker | 52838 | 28.4 | |
Non-drinker | 113929 | 61.4 | |
Depression status | Not depressed | 173416 | 93.4 |
Depressed | 12203 | 6.6 | |
Year | 2008 | 37205 | 20.0 |
2009 | 38690 | 20.8 | |
2010 | 35896 | 19.2 | |
2011 | 36825 | 19.8 | |
2012 | 37003 | 20.0 |
Methods
For the analysis of the data, a time-varying coefficient model was found using P-spline estimation which requires indicating the degree of the spline (usually a B-spline) as well as the degree of the difference penalty (the difference of the adjacent B-spline coefficients) to perform the computation. Following the recommendations of Eilers and Marx [26], a third degree B-spline is used with a second order difference penalty. This method also requires placing a large number of knots to purposefully overfit the data so that the penalty does the work of regulating the smoothness of the coefficient functions. Below is a brief presentation of the estimation method used followed by the description of the methodology and application used to fit the time-varying coefficient model for smoking status outcome in Italy.
Model and estimation method
where E(ε)=0 and var(ε)=σ^{2} [27]. The covariates are Z=(Z_{1},…,Z_{ p })^{ T } with the constant coefficients b_{ j } and X=(X_{1},…,X_{ p })^{ T } with the varying coefficients a_{ j }. The intercepts are represented by b_{0} and a_{0}, where b_{0} is the constant intercept and a_{0} is the time varying intercept. The effect modifier covariate t represents time; however, the effect modifier can technically be any covariate or multiple covariates and it is the variable that is used to express the coefficients as a function. In this case, the coefficients are a function of time. This model is referred to by Hastie and Tibshirani [27] as a varying coefficient model with a single effect modifier and can also be referred to as a time-varying coefficient model. In BRFS data the time variable represents the time of observation. In longitudinal data, the age of the respondent can also represent the time effect modifier, since the same observation is followed over time.
The VCM can be extended to the non-Gaussian cases, i.e. for a generalized varying coefficient model, through a link function g(μ) as in generalized linear models. In the generalized varying coefficient model, we have a random variable Y with a distribution which depends on a parameter η which is a linear predictor related to the mean μ=E(Y) by η=g(μ) [27]. In the Gaussian case, the normally distributed random variable Y has a mean μ=η=g(μ). Another common model is the case where Y is a binary variable and the model is a logistic model with \(g(\mu)= \log \left \{\frac {\mu }{(1-\mu)}\right \}\). This is an especially common case in the analysis of health data in which the response variables are usually binary variables.
Researchers studying the varying coefficient model have mainly used non-parametric methods with two main estimations techniques: kernel-local polynomial smoothing and estimation using splines (polynomial spline, penalized regression spline or smoothing spline). Kernel-local polynomial estimation was used by Cheng et al. [28] for estimating a varying coefficient model with a binary response variable of infant mortality, and Cai et al. [29,30] applied the technique to Poisson and binary response variables. Smoothing spline estimation was first described by Hastie and Tibshirani [27] and was used by Hoover et al. [7] and Chiang et al. [31] for longitudinal data analysis; polynomial and penalized regression estimation have been discussed by Huang et al. [32] and Marx [33]. The use of parametric methods for estimating varying coefficient models are not favoured for estimation due to the lack of flexibility of these methods as well as the strong assumptions required which can lead to misspecification of the data and large bias [27,34]. The analysis in this paper uses non-parametric techniques with the PASSI surveillance data to find a generalized time-varying coefficient model. P-spline estimation was selected for estimation mainly due to the flexibility it provides the user as discussed herin and its reduced computation time compared to other methods considered (results not shown). However, other spline or kernel methods can also be used.
Splines are merely functions joined together at certain positions (or knots), and the different techniques for estimation using splines depend on the method used to regulate the smoothness of the functions. In polynomial spline estimation, there is no penalty and smoothness is regulated by selecting the ideal number of knots using a certain criteria such as the Akaike information criteria (AIC). This selection of the number of knots can greatly increase computation time as several models need to be fit with different number of knots in order to select the model which provides the minimum AIC.
where n_{ j } are the number of unique values of t (i.e. the number of knots). The B_{ jl }(t) are basis functions for the j^{ t h } variable with parameters γ_{ ij }, and can be polynomial bases, natural cubic splines or B-splines functions [27].
The matrix \(P = block diag(\lambda _{0} \boldsymbol {\Delta }_{d}^{T} \boldsymbol {\Delta }_{d}, \ldots, \lambda _{p} \boldsymbol {\Delta }_{d}^{T} \boldsymbol {\Delta }_{d})\) has a block diagonal structure that breaks the linkage of the penalization from one smooth term to the next. The penalty in P-spline estimation can have different degrees so that the first, second or third difference can be taken. Eilers and Marx [26] recommend using at least a second degree difference penalty with either a quadratic or cubic B-spline basis. Therefore, the P-spline estimation method allows more flexibility as it allows the selection of the degree of the penalty and the spline separately which cannot be done using other spline estimation methods. In addition, the difference penalty described above can be considered simpler and less computationally intensive compared to the smoothing spline penalty shown in (2).
Application: fitting the time VCM
A varying coefficient model is applied to the Italian PASSI data for the period from 2008 to 2012 which provided a sample size of 185,619 individuals. The outcome of interest is smoking status which is a binary response variable to indicate current smokers and non-smokers, where non-smokers includes ex-smokers. Using ten independent variables constructed from the data, the effects of these variables are examined to observe their trends with time and understand the changing characteristics of smokers during this period (see Table 1 for a description of the variables used in the analysis).
For the period of observation used in the analysis (from 2008 to 2012) there are 55 months in which data collection was performed, as July and August were combined in each year in the data collection phase. These 55 months represent the effect modifier of time in the analysis which is constructed from the month and year of the survey data for each observation. They also represent the maximum number of knots that can be used in the estimation, and which is the number of knots used in this application. It is not required or advisable to place the maximum number of knots for very large datasets as this will greatly increase computation times for fitting the models. However, the computation time of the present application allowed for placing the maximum number of knots. This did not affect the results as the presence of the penalty term controls for any over-smoothing.
where \(\text {logit}(SMK) = \log \frac {\Pr (Y = SMK)}{1 - \Pr (Y=SMK)}\) or the log of the odds of being a smoker (SMK). The Z_{ j } are the constant variables from the parametric model with constant coefficients b_{ j }, and a_{1}(t) is the varying coefficient of X_{1}. Note that the variable X_{1} is contained in Z_{ j } with its constant coefficient b_{ j }. This test would show if the varying coefficients a_{1}(·) are actually varying or should remain constant, and it is performed using a likelihood ratio test with a chi-square distribution. This test can be used since there are a limited number of parameters to be estimated (i.e. the number of parameters and not increasing with increasing sample size), and the sample size is large enough to guarantee the asymptotic chi-square distribution of the test statistics. All tests and estimations were performed at the 95% confidence level using R statistical software as described further in the Endnote.
where H_{0} is the null hypothesis for the model which contains the first varying coefficient and H_{1} is the alternative hypothesis for the addition of a second varying coefficient for the variable X_{2}. The variable X_{2} would also have been used in the first test to see if it has coefficients that are actually varying. Therefore, only the variables which were found to have varying coefficients when tested alone would be considered for testing in the stepwise selection process.
Results
Smoking time-varying coefficient model
Fitting the smoking status time-varying coefficient model
Model | Description | Time (min) | p-value | H_{0} used | AIC | df |
---|---|---|---|---|---|---|
of test | for test | |||||
Selection of variables that have varying coefficients | ||||||
LM | logistic model | <1 | - | - | 206590.4 | 21.00 |
Model age | LM + s(t):age | 4.7 | <0.001 | LM | 206554.4 | 28.88 |
Model alcohol | LM + s(t):alcohol use | 2.1 | <0.001 | LM | 206566.0 | 27.39 |
Model physical | LM + s(t):physical activity | 2.1 | <0.001 | LM | 206570.1 | 26.19 |
Model income | LM + s(t):income | 2.2 | <0.001 | LM | 206574.8 | 26.33 |
Model mstatus | LM + s(t):martital status | 2.1 | <0.001 | LM | 206573.8 | 26.50 |
Model edu | LM + s(t):education | 3.1 | <0.001 | LM | 206573.2 | 26.26 |
Model sex | LM + s(t):sex | 1.4 | <0.001 | LM | 206572.4 | 25.37 |
Model work | LM + s(t):work status | 1.4 | <0.001 | LM | 206575.2 | 25.08 |
Model region | LM + s(t):region | 2.2 | <0.001 | LM | 206573.2 | 26.81 |
Model depress | LM + s(t):depression status | 1.4 | <0.001 | LM | 206574.2 | 24.49 |
Model time | LM + s(t) | 1.6 | <0.001 | LM | 206572.7 | 23.72 |
Finding the full varying coefficient model | ||||||
Model I | Model age + s(t):alcohol use | 8.9 | 0.008 | Model age | 206548.7 | 33.21 |
Model II | Model I + s(t):physical activity | 16.8 | 0.070 | Model I | 206547.6 | 36.40 |
Model III | Model II + s(t):income | 25.7 | 0.261 | Model II | 206549.0 | 38.45 |
Model IV | Model II + s(t):marital status | 24.4 | 0.539 | Model II | 206550.8 | 38.72 |
Model V | Model II + s(t):education | 36.5 | 0.227 | Model II | 206549.4 | 39.48 |
Model VI | Model II + s(t):sex | 24.4 | 0.125 | Model II | 206547.3 | 37.34 |
Model VII | Model II + s(t):work status | 21.4 | 0.550 | Model II | 206549.3 | 27.38 |
Model VIII | Model II + s(t):region | 26.5 | 0.470 | Model II | 206550.0 | 38.32 |
Model IX | Model II + s(t):depression status | 22.0 | 0.369 | Model II | 206548.9 | 37.42 |
Model X | Model I + s(t) | 10.4 | 0.006 | Model I | 206548.7 | 33.21 |
Notes: s(t) - spline of time |
where Z_{ j } are the covariates with constant parameters b_{ j }, a(t)=(a_{1}(t),a_{2}(t)) are the time varying coefficients for the variables age and alcohol consumption respectively, and a_{0}(t) is the time varying intercept.
Summary of the smoking status varying coefficient model (Model X)
Variable | OR (95% C.I.) | p-value |
---|---|---|
Age (Reference: 60-59) | ||
18-29 | 2.08 (1.91-2.25) | <0.001 |
30-39 | 1.79 (1.64-1.95) | <0.001 |
40-49 | 1.75 (1.61-1.89) | <0.001 |
50-59 | 1.46 (1.34-1.60) | <0.001 |
s(time):18-29 | - | 0.003 |
s(time):30-39 | - | 0.649 |
s(time):40-49 | - | 0.046 |
s(time):50-59 | - | 0.619 |
s(time):60-69 | - | 0.621 |
Sex (Reference: Female) | ||
Male | 1.61 (1.58-1.64) | <0.001 |
Marital status (Reference: Married) | ||
Single | 1.47 (1.44-1.51) | <0.001 |
Widowed or divorced | 1.84 (1.78-1.90) | <0.001 |
Education (Reference: University | ||
or higher) | ||
High school | 1.36 (1.32-1.39) | <0.001 |
Middle school | 1.81 (1.75-1.87) | <0.001 |
Primary school or less | 1.44 (1.38-1.51) | <0.001 |
Income (Reference: High) | ||
Medium | 1.30 (1.27-1.32) | <0.001 |
Low | 1.78 (1.73-1.83) | <0.001 |
Work status (Rerference: Works) | ||
Does not work | 0.74 (0.73-0.76) | <0.001 |
Region (Reference: North) | ||
Centre | 1.21 (1.19-1.24) | <0.001 |
South | 1.10 (1.07-1.12) | <0.001 |
Physical activity (Reference: Active) | ||
Partially active | 0.95 (0.93-0.97) | <0.001 |
Sedentary | 1.15 (1.12-1.18) | <0.001 |
Alcohol consumption (Reference: | ||
High risk drinker) | ||
Low risk drinker | 0.69 (0.64-0.75) | <0.001 |
Non-drinker | 0.47 (0.43-0.52) | <0.001 |
s(time):High risk drinker | - | 0.353 |
s(time):Low risk drinker | - | 0.232 |
s(time):Do not drink | - | 0.038 |
Depression (Reference: | ||
Not depressed) | ||
Depressed | 1.43 (1.39-1.48) | <0.001 |
Discussion
From this first application to this kind of data, the varying coefficient model, using P-spline estimation, appears to be a useful tool for evolutionary analysis using surveillance data The combination of the availability of big data such as the BRFS and a suitable sophisticated (for the epidemiological field) statistical approach is able to offer unique and important information in the public health field. This combination, at least from our analyses, seems to fulfil the search for evidence required in public health [40], particularly in understanding (and not just showing) the main changes in health outcomes and in health related risk factors. It highlights the major problems in the sub-populations of interest, and shows where and with whom interventions may be effective or may require more attention.
Another interesting aspect for both analysts and readers (epidemiologists, decision makers, etc.) is that this analytical approach returns an easily readable model as an output, showing what is influencing the outcome variable as well as what appears stable or changing over time. In addition, showing using OR plots the kind of change that has happened over time, without any linearity constraints, or any need to be monotonic. From a computational point of view, adopting the procedures proposed herein seems to be not too time consuming and quite straightforward, therefore showing promise for use in big data applications.
For the purpose of completeness and to further show how informative the approach can be, we will now comment on the specific results of the case examined in this application. Recent years have seen in Italy, as in many European countries, a steady and slow decline in the prevalence of smoking [41]. When, how, and where this decline is more evident are the fundamental questions to answer for evaluating past health promotion and public health action, and to drive future policies and interventions. Behavioural risk factor surveillance systems can offer abundant data to start to answer these and similar questions. In the case study presented, using the Italian surveillance PASSI, we have seen many potentially influential variables, such as gender, income or education, which appear to affect smoking prevalence in a stable manner over time, since these variables were found to have constant coefficients. Meaning that, for instance, actions aimed to reduce inequalities among these population strata in the five years of observation have not been successful, having the most deprived strata of the population odd ratios,all the other variables kept constant, significantly higher than those higher educated, higher income, etc.
On the other hand and as seen in the OR plots, some odds ratios have changed in recent years. For instance, the youngest age category and those who do not drink are smoking less compared to their respective reference categories. The preliminary results from the analysis is encouraging: in this five year period the younger generations seem to be smoking slightly less (decreases are significant, but still the ORs remain greater than one). This result is probably a combined effect of several interventions carried out in Italy in recent years, particularly targeted towards young people and school children. However, the analysis has shown that there is still much needed to be done to decrease smoking in other sub-populations. The small decrease of smoking prevalence observed in these years can be attributed to the fact that except for the youngest age group and non-drinker, those at higher risk do not appear to have changed their smoking behaviour. The poor, less educated, sedentary etc. present a prevalence of smoking higher than the others and these differences are constant over the years. What is worth noting is that the clustering of bad behaviours seems to be reinforced: those with a more risky drinking behaviour present an increasing prevalence of smokers (or, as shown in the graphs, those with less risky drinking behaviour present a decreasing prevalence of smoking). This could be an alarm bell for those working in the public health field, also showing possible tracks for interventions. If targeted interventions have been rather successful for some sub-populations, such as the young, more targeted interventions are needed also for other sub-populations that typically are more difficult to reach by general population interventions. Among the limitations of surveillance data, is the absence of a longitudinal component therefore the analysis is only able to show associations and not causal processes. Nevertheless, an analysis for sub-populations can be very informative as shown in this case. These results, obviously relevant for any policy analysis, have been possible through the proposed analysis although the time span of observation was only five years: it is possible that longer periods of observation could offer even more interesting results. In addition, it is worth noting that applying this method to health outcomes that need more time to change (both physically and behaviourally), such as some chronic diseases, may perhaps need a longer surveillance period to be able to observe interesting trends in the time varying coefficients.
Conclusion
All these findings must be considered as preliminary results, since they are coming from a first application of this method to a single data system. Further analysis and application to other surveillance systems could provide further insight on how much this approach could result in a fundamental tool for a dynamic analysis of surveillance data. In addition, the present application does not include any interaction terms that may further affect the final results. Any significant interaction terms can be added beginning with a parametric logistic model before adding the varying coefficient terms. This was not performed in this case as the main purpose here was to demonstrate the usefulness of the method to BRFS data and not to present the most suitable model for smoking status in Italy.
Considering the limitations of this study, we can conclude that the application of the VCM techniques to BRFS data allows for the study of the changing effects of possible determinants on a health outcome/risk factor in order to better inform policy interventions.
As for the specific case studied, PASSI surveillance data analysed over time show how much there is still to do in order to produce a more relevant decrease of smoking prevalence in Italy. Some good signals are detected in the slow decrease in the OR of smoking among the youngest age category. However, no other sign of significant change has been observed. Major differences among population subgroups still exist indicating potential health inequalities and worrisome clustering of smoking behaviour with other negative behaviours such as risky alcohol drinking and sedentary behaviours.
Endnote
^{a} Varying coefficient models can be estimated using the mgcv package in R software [42,43] and using the gam function that is used for fitting generalized additive models. However, since the data in the presented application is relatively large, to save computation time, the bam function of the package is used which works like the gam function but is designed for large datasets [43]. This can be especially useful in surveillance data analysis as longer periods of observation indicate very large and increasing sample size. When compared to the gam function, the bam function can take minutes to fit the most complicated model compared to several hours and even days depending on the sample size. For this function to perform even faster, the method used for selection of the smoothing parameter λ is by the fast REML computation method instead of the generalized cross validation method usually used by gam function. To use this function for estimating a varying coefficient model the “by” option is used as shown in the following example:
bam(SMK ∼Zj + INC + s(time, bs = ~ps~, k=55, m=c(3,2), by = INC), family = binomial(~logit~)),
where SMK is the response variable for smoking status, INC is the independent variable for income status, Z_{ j } are all the other independent variables with constant coefficients, ps is for P-spline estimation, k=55 is the number of knots, and m=c(3,2) indicates the use of the third degree B-spline bases with a second order difference penalty.
For plots produced by the plot.gam function of the mgcv package, Bayesian confidence intervals are used for plotting of the smooth terms, which can be obtained by simulating from the posterior distribution of the functional coefficients (or varying coefficients) [39] For model selection, esting between nested models was performed using anova(model 1,model 2, test="Chisq") [39]. In addition the AIC of the models were found using the AIC function in R.
Declarations
Acknowledgements
The authors acknowledge the work of all the PASSI network in Italy. Without their precious effort no information would have been available for the analysis performed in this paper. They would also like to thank the reviewers for their helpful comments as well as Clara Burgert for her edits, all of which contributed to improving the paper.
Authors’ Affiliations
References
- Lee L, Thacker S. Principles and Practice of Public Health Surveillance. USA: Oxford University Press; 2010.View ArticleGoogle Scholar
- Campostrini S, McQueen D, Taylor A, Daly A. World alliance for risk factor surveillance white paper on surveillance and health promotion. Aims Public Health. 2015; 2:10–26.View ArticleGoogle Scholar
- Mokdad A. The behavioral risk factors surveillance system: Past, present, and future. Annu Rev Public Health. 2009; 30:43–54.View ArticlePubMedGoogle Scholar
- Campostrini S. Surveillance systems and data analysis: continuously collected behavioural data In: McQueen D, Puska P, editors. Global Behavioral Risk Factor Surveillance. New York: Springer: 2003. p. 47–56.Google Scholar
- Campostrini S. Surveillance for NCDs and Health promotion: an issue of theory and method In: McQueen D, editor. Global Handbook on Non-communicable Diseases and Health Promotion. New York: Springer: 2013.Google Scholar
- Huang J, Wu C, Zhou L. Polynomial spline estimation and inference for varying coefficient models with longitudinal data. Statistica Sinica. 2004; 14(3):763–88.Google Scholar
- Hoover D, Rice J, Wu C, Yang L. Nonparametric smoothing estimates of time-varying coefficient models with longitudinal data. Biometrika. 1998; 85(4):809–22.View ArticleGoogle Scholar
- Wang L, Li H, Huang J. Variable selection in nonparametric varying-coefficient models for analysis of repeated measurements. J Am Stat Assoc. 2008; 103(484):1556–69.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu C, Yu K, Chiang C. A two-step smoothing method for varying-coefficient models with repeated measurements. Ann Inst Stat Mathematics. 2000; 52(3):519–43.View ArticleGoogle Scholar
- Wu C, Chiang C. Kernel smoothing on varying coefficient models with longitudinal dependent variable. Statistica Sinica. 2000; 10(2):433–56.Google Scholar
- Zhang D. Generalized linear mixed models with varying coefficients for longitudinal data. Biometrics. 2004; 60(1):8–15.View ArticlePubMedGoogle Scholar
- Young L, Gotway C, Yang J, Kearney G, DuClos C. Assessing the association between environmental impacts and health outcomes: A case study from florida. Stat Med. 2008; 27(20):3998–4015.View ArticlePubMedGoogle Scholar
- PASSI - Coordinating technical group of the behavioural risk factor system. PASSI (Progressi Delle Aziende Sanitarie per la Salute in Italia) - The Italian behavioral risk factor surveillance system. 2013. Available at: http://www.epicentro.iss.it/passi/en/english.asp. Rome. Retrieved on: 27 June 2013.
- Baldissera S, Campostrini S, Binkin N, Minardi V, Minelli G, Ferrante G, et al. Features and initial assessment of the Italian Behavioral Risk Factor Surveillance System (PASSI), 2007-2008. Preventing Chronic Disease. 2011; 8(1):A24. http://www.cdc.gov/pcd/issues/2011/jan/10_0030.htm.PubMedGoogle Scholar
- Minardi V, Campostrini S, Carrozzi G, Minelli G, Salmaso S. Social determinants effects from the italian risk factor surveillance system passi. Int J Public Health. 2011; 56(4):359–66.View ArticlePubMedGoogle Scholar
- Binkin N, Gigantesco A, Ferrante G, Baldissera S. Depressive symptoms among adults 18–69 years in italy: results from the italian behavioural risk factor surveillance system, 2007. Int J Public Health. 2010; 55(5):479–88.View ArticlePubMedGoogle Scholar
- Chandola T, Head J, Bartley M. Socio-demographic predictors of quitting smoking: how important are household factors?. Addiction. 2004; 99(6):770–7.View ArticlePubMedGoogle Scholar
- Hiscock R, Bauld L, Amos A, Fidler J, Munafò M. Socioeconomic status and smoking: a review. Ann New York Acad Sci. 2012; 1248(1):107–23.View ArticleGoogle Scholar
- Levy D, Romano E, Mumford E. The relationship of smoking cessation to sociodemographic characteristics, smoking intensity, and tobacco control policies. Nicotine Tobacco Res. 2005; 7(3):387–96.View ArticleGoogle Scholar
- Siahpush M, Borland R. Socio-demographic variations in smoking status among australians aged ≥18: multivariate results from the 1995 national health survey. Aust New Zealand J Public Health. 2001; 25(5):438–42.View ArticleGoogle Scholar
- Tramacere I, Gallus S, Zuccaro P, Colombo P, Rossi S, Boffetta P, et al. Socio-demographic variation in smoking habits: Italy, 2008. Preventive Med. 2009; 48(3):213–7.View ArticleGoogle Scholar
- Cook J, Fucito L, Piasecki T, Piper M, Schlam T, Berg K, et al. Relations of alcohol consumption with smoking cessation milestones and tobacco dependence. J Consulting Clin Psychology. 2012; 80(6):1075.View ArticleGoogle Scholar
- Boden J, Fergusson D, Horwood L. Cigarette smoking and depression: tests of causal linkages using a longitudinal birth cohort. Br J Psychiatry. 2010; 196(6):440–6.View ArticlePubMedGoogle Scholar
- Pratt L, Brody D. Depression and smoking in the US household population aged 20 and over, 2005-2008. NCHS Data Brief. 2010; 34:1.Google Scholar
- Trost S, Owen N, Bauman A, Sallis J, Brown W. Correlates of adults’ participation in physical activity: review and update. Med Sci Sports Exercise. 2002; 34(12):1996–2001.View ArticleGoogle Scholar
- Eilers P, Marx B. Generalized linear additive smooth structures. J Comput Graphical Stat. 2002; 11(4):758–83.View ArticleGoogle Scholar
- Hastie T, Tibshirani R. Varying-coefficient models. J R Stat Soc. Ser B (Methodological). 1993; 55(5):757–96.Google Scholar
- Cheng M, Zhang W, Chen L. Statistical estimation in generalized multiparameter likelihood models. J Am Stat Assoc. 2009; 104(487):1179–91.View ArticleGoogle Scholar
- Cai Z, Fan J, Li R. Generalized Varying-coefficient models. UC Los Angeles: Department of Statistics, UCLA. 1999. Available at: http://escholarship.org/uc/item/7n0494s4, Department of Statistics, UCLA, Los Angeles.
- Cai Z, Fan J, Li R. Efficient estimation and inferences for varying-coefficient models. J Am Stat Assoc. 2000; 95(451):888–902.View ArticleGoogle Scholar
- Chiang C, Rice J, Wu C. Smoothing spline estimation for varying coefficient models with repeatedly measured dependent variables. J Am Stat Assoc. 2001; 96(454):605–19.View ArticleGoogle Scholar
- Huang J, Wu C, Zhou L. Varying-coefficient models and basis function approximations for the analysis of repeated measurements. Biometrika. 2002; 89(1):111–28.View ArticleGoogle Scholar
- Marx B. P-spline varying coefficient models for complex data In: Kneib T, Tutz G, editors. Statistical Modelling and Regression Structures. Berlin Heidelberg: Springer: 2010. p. 19–43.Google Scholar
- Fan J, Zhang W. Statistical methods with varying coefficient models. Stat Its Interface. 2008; 1(1):179–95.View ArticleGoogle Scholar
- Eilers P, Marx B. Flexible smoothing with b-splines and penalties. Stat Sci. 1996; 11(2):89–102.View ArticleGoogle Scholar
- Wasserman L. All of Nonparametric Statistics. New York: Springer; 2006.Google Scholar
- Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning, 2nd. New York: Springer; 2009.View ArticleGoogle Scholar
- Wood S. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. J R Stat Soc: Ser B (Stat Methodology). 2011; 73(1):3–36.View ArticleGoogle Scholar
- Wood S. Generalized Additive Models: an Introduction with R. Boca Raton, FL: Chapman & Hall; 2006.Google Scholar
- Campostrini S. Measurement and effectiveness: methodological considerations, issues and possible solutions. In: Global Perspectives on Health Promotion Effectiveness. New York: Springer: 2007. p. 305–25.Google Scholar
- La Vecchia C, Bosetti C, Lucchini F, Bertuccio P, Negri E, Boyle P, et al. Cancer mortality in europe, 2000–2004, and an overview of trends since 1975. Ann Oncol. 2010; 21(6):1323–60.View ArticlePubMedGoogle Scholar
- R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2012. R Foundation for Statistical Computing. ISBN 3-900051-07-0. http://www.R-project.org/.Google Scholar
- Wood S. Mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML Smoothness Estimation. 2007. R package version 1.7-22.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.