Descriptive statistics for rates
Suppose that we want to assess the burden of a disease in a certain population of size N during a certain period of time. Consider that we have observed X cases of the disease under study, therefore the crude rate (CR) is defined as X/N. The CR is the simplest and most straightforward summary measure of the population’s diseases under study. But the events may be strongly related to age, so the age-specific events will differ greatly from one another, therefore it is of interest to calculate the age-specific rates. The use of a world standard population [12] and direct standardization (or any other adjustment procedures) seek to provide numbers and comparisons that minimize the influence of age and/or other extraneous factors through the age-standardized rates (ASR) [13]–[15]. These ASRs can be also truncated for the age groups of interest. In cancer, the calculation of truncated rates (TR) over the age-range 35-64 [14] has been proposed, mainly because of doubts about the accuracy of age-specific rates in the elderly when diagnosis and recording of cancer may be much less certain. Finally, another useful summary measure of disease frequency is the cumulative rate [14, 15] (CumR), which is the sum of the age-specific incidence rates, taken from birth to age 74, in a certain time period. CumR is an estimate of the cumulative risk (Cumulative Risk= 1-exp[−CumR]), which is the risk which an individual would have of developing an event of interest during a certain age-span if no other causes of death were in operation [14].
Estimating the annual percent change in rates (EAPC)
In descriptive epidemiology, the evolution of incidence or mortality rates of certain disease during a determined time period can generate etiological hypotheses. The estimated annual percent change (EAPC) is one way to characterize trends in disease rates over time. This means that the rates are assumed to change at a constant percentage of the rate of the previous year [15]. Let us assume that we want to assess the EAPC of ASRs during a period of time (in cancer, usually years). Let ASR
T
be the ASR for the Tth year, T; the time trend of the ASRs can be modelled through a Gaussian log-linear model,
where EAPC = (e
β − 1)·100. The 95% confidence intervals of the EAPC can be easily derived through the standard errors of model (1) [15].
Predicting the Expected number of incident (or death) disease cases by age group using the time trends of rates
Prediction of future disease burden is essential for effective health service planning, as it may be utilized by public health authorities to formulate prevention, diagnosis and treatment strategies [16]. Simple log-linear models can be used to make these predictions [17]. If we assume that the time trend of
, where C
iT
is the number of cases for the i
th age-group and period T and Y
iT
are the corresponding person-years at risk is linear in its log-scale, the following log-linear model can be fitted to these rates:
(2)
where T
0 is the reference time, α
i
is the log-rate at T
0
for the i
th age-group and β
i
is the age-specific slope. A parsimonious version of this model can be also used assuming a common slope for each age group,
(3)
where this model is known as the age-drift model. For these models we assume C
iT
to follow the Poisson distribution [17]. However, the negative binomial distribution has been also used as an alternative to Poisson when there is evidence of “overdispersion” (higher variance than expected) in the data [18].
Prediction of incidence at a future time F can be made using the fitted model (2) or (3), and replacing T by F and Y
iT
by Y
iF
into the fitted model. Poisson and Negative Binomial distribution are both assumed for each model in (2) and (3). Therefore 4 models are assessed for the selection of the best fitting model to data. The assessment is made through the Akaike’s Information Criterion (AIC) [19] and the Chi-square test [17].
Comparing risk between two groups (time periods or geographical areas): standardized incidence or mortality ratio (SIMR)
SIMR is used to determine if the occurrence of a disease in a target population is higher or lower than that occurrence in a reference one. For example, we can either compare the incidence of cancer in the same area in two different time periods or two different areas in the same time period. The SIMR can be calculated as
where D is the number of observed events in the target population and E the number of expected events in this population using the incidence (or mortality) rates of the reference population [15].
Assessment of differences due to risk and demographic factors when comparing disease rates of two populations
To assess differences for incidence or deceased cases between two different time points or two areas in order to clarify the role of the changes in demographic factors and the risk of developing or dying from a disease, we used the method of Bashir and Estève [20]. For example, we can compare the observed age-specific incidence cases of certain cancers in the period 1995–1999, , with the observed age-specific incidence cases in the period 2000–2004, . Assuming eighteen 5-year age-groups, the observed percent net change of the difference in the total number of cases between both periods can be calculated as
(5)
Net(%), can be separated into two components: i) changes in size and age distribution (structure) of the population and ii) changes in the risk of developing the disease,
(6)
We note that in each age group we must take into account that rates into the period 2000–2004 must be considered as constant as well as rates into the period 1995–1999. If the population size is expected to increase by 10%, incident cases will also increase by 10%. The effect of population structure is estimated by comparing the rate observed in 1995–1999 and the rate expected in the 2000–2004, through applying the age specific rates observed in 1995–1999 to the population pyramid in 2000–2004. Lastly, the percent change not explained by percent change in the population will be considered to be due to the variation in risk of developing the disease [20]. We note that the net change can be also calculated for the CR [20]. Mathematical details of equation (6) can be found in the Additional file 1.
Assessing survival of a cohort of patients diagnosed with a certain disease
The observed survival (OS) rate is the basic measure of the survival experience of a group of patients from the date of diagnosis to a certain time. However, information on the cancer patients’ causes of death might not be always suitable or it might be vague or unavailable [15, 21]. Since the interest lies in describing mortality attributable to the disease under study, one method of estimating net survival, where the disease of interest is assumed to be the only possible cause of death [21], is relative survival (RS). It is interpreted as the probability to survive after diagnosis of the disease of interest. For a cohort of patients diagnosed with a certain disease, say cancer for example, the cumulative RS at time T is defined as
where S
o
(T) is the observed survival rate in the cohort of study and S
E
(T) is the expected survival of that cohort, this last estimated from a comparable general population life tables stratified by age, sex and calendar time and assume that the cancer deaths are a negligible proportion of all deaths [21]. The RS(T) can be calculated through estimating S
o
(T) by the Kaplan–Meier method and S
E
(T) using Hakulinen method [22]. The 95% confidence intervals of the RS(T) can be estimated through the standard errors of the log-transformation of S
o
(T) assuming S
E
(T) as a constant value [23]. Some interpretations about RS are not straightforward. Note that improvements in general mortality of the reference population affect S
E
(T) in Equation (7) [24]. Let’s suppose we want to compare 5-year RS of lung cancer between periods 1999–1994 (RS(5)=10,5%) and 1995–1999 (RS(5)=8,5%) among men in Catalonia (Spain) [25]. Although cancer mortality decreased in 2000–2004 compared to 1995–1999 in Catalonia [26], we observed a decrease of 5-year RS of lung cancer. It may suggest that RS(5) was worsening in 1995–1999 but the explanation is that S
E
(5) between both periods increased but S
o
(5) remained stable, and therefore RS(5) decreased [26]. In this line, two period comparison of RS for cancers with poor survival should be interpreted with caution [24].
The set of web-based applications included in REGSTATTOOLS
REGSTATTOOLS (http://regstattools.net/) includes a set of web-based statistical applications running under Linux operating system installed in a web-server: SART (Statistical Analysis of Rates and Trends) [27], RiskDiff (a web tool for the analysis of the difference due to risk and demographic factors for incidence or mortality data) [28] and WAERS (Web-Assisted Estimation of Relative Survival) [29]. The web pages of all these applications were implemented using the server-side scripting language PHP and HTML [[30] whereas statistical computation has been implemented using R statistical software [31].
Figure 1 shows an overview of the REGSTATTOOLS applications. Each application requires at least one input file to perform the corresponding statistical analysis. The user must have an Individual Records data file available which is a basis file to perform all the statistical analysis, since most of the data files used in the web applications are derived from this one. This basis file must contain the following patients’ variables: patient identification, sex, type of disease, age-month-year at diagnosis, status of the patient (dead or alive), month-year of the death, and finally, the follow-up time in years. The SART applications require a total of 6 files to be uploaded whereas RiskDiff and WAERS applications only require one file each.
The SART applications [27] require an Aggregated Data file that must contain the following columns: sex, age-group, incidence or mortality year, type of disease, cases and person-years at risk. To perform a descriptive analysis of the disease rates, the user can make use of the Descriptive application after preparing an age-groups file and a standard population’s file. The time trends analysis of rates can be performed using the EAPC application which also requires a standard population file. The application Expected allows a prediction of the expected number of cases in a future period or in other geographical areas, using the aggregated data and a file which contains an external population distribution (person-years at risk by age-group). The comparison of risk between two groups can be performed through SIMR application which requires a partition of the Aggregated Data file into two files, each one with data of the corresponding time period. Another possibility could be comparing these data with data from another area in the same time period; therefore, two files are required. In this line, note that the user must prepare 6 files to fully run SART.
The RiskDiff application [28] has been developed to perform the analysis described in section Assessment of differences due to risk and demographic factors when comparing disease rates of two populations. It requires information on the number of cases and person years at risk by age-group in each one of the two periods or two geographical areas to be compared. In this line, the Aggregated Data Selection file must contain 2 columns for each period/area compared: one column referring to person-years and another column referring to number of cases.
Finally, the RS must be obtained through the WAERS application [29] which requires a Selection of Individual Records file with the following variables: patients ID, age and year at beginning of study, sex, years of follow-up and vital status (death or alive).
We will refer to AF throughout the paper where additional figures and tables can be found, and those that are related to the example section.