Pandemics are major disruptions that demand complex, multi-dimensional analytics to understand the diverse drivers of morbidity and mortality at the scale of populations. Our analysis identified the subset of variables which are significant across multiple timeframes using a more comprehensive set of variables than previous research. The key findings identified community-level population characteristics, access to care/health resources, vulnerable populations, COVID-19 related behaviors and policies, resource deprivation and traveling behavior as important SDoH categories as having persistently increased risk of COVID-19 outcomes. Further, our diagnostic testing showed that cumulative and 14-day maximum death counts models had the best fit for the 30-, 60-, and 90-day models.
Several findings are consistent with previous research [11, 43, 44]. Vulnerable populations (higher percentage of black population, those over 65 and foreign-born populations) are at disproportionately increased risk [45,46,47]. Percent-black and percent-foreign-born were associated with cumulative case and deaths counts as well as maximum 14-day rolling averages for cases and deaths. Percent-over-65 were associated with only deaths but not cases. The effect of these variables on outcomes decreased as the pandemic progressed. This effect was also seen in other studies [48]. This highlights the role of structural racism in the pandemic. The vulnerable foreign born and black populations bore the brunt of the pandemic in the earlier phase. A lack of resources to “weather the storm” may have been responsible [49]. Many vulnerable population members are engaged in front line jobs that require in person presence at work. This would hinder their ability to shelter in place effectively and lose the protection offered by sheltering in place that was found to be protective for all outcomes. As the pandemic progressed, these vulnerable population effects decreased on the outcomes. This makes it likely that the effect seen at the onset of the pandemic was not due to some genetic characteristic inherent in the population which led to greater impact on vulnerable populations, but the socioeconomic disparities faced by vulnerable populations that made them more susceptible and less able to bounce-back from COVID infection. Prior work has also shown that racial characteristics do not lead to poorer outcomes once hospitalized for COVID-19 [6]. The over-65 population was not more susceptible to COVID-19 infections but had higher mortality once infected. This observation provides further evidence in support of the likelihood that socioeconomic disparities were responsible for the disparate impact of the pandemic on vulnerable populations.
Additionally, percent-public-assistance and percent-walk were identified as protective factors against mortality in neighborhood level analysis [50]. It is an interesting finding that counties with high rates of public assistance use had lower rates of COVID-19 as that is an indication that the county has a high rate of impoverished residents. One potential explanation is that those using governmental programs could avoid environments that put them at risk and receive better education on pandemic mitigation strategies.
Our figures show a consistent or increasingly negative magnitude of the regression coefficient of the days-since-pandemic-start variable across different time frames. Even when accounting for SDoH and population characteristics, counties with more days between the pandemic start date and when they met the threshold for COVID-19 being present had increased death and case counts as well as high 14-maximum death and case averages.
Surprisingly, this was not directly related to RUCC designations, as both rural and urban RUCC codes were found to be not important to either cumulative or 14-day cases, with the exception of RUCC 8, a highly rural category. While, in most states, metropolitan areas were hit initially before COVID-19 spread out to rural areas, there were some exceptions, such as in South Carolina where the pattern was reversed [51]. The pattern of RUCC codes here doesn’t match a clear rural (4-9) or urban pattern, either. Other papers analyzing cumulative county death and case counts for SDoH did not factor in the existence of a COVID-19 characteristic, when COVID-19 was first present in the county. We found this feature to be important as a protective factor for cases and deaths. However, features important for individual case-finding studies may differ from the trends observed at the county-level or other spatial-units; we may consider these county-level associations as hypothetically important and informative for further causal research.
Health domain factors (percent-obese and percent-who-smoke) were not significant in our models for cumulative deaths or 14-day maximum average deaths. This is in contrast to other research using both county-level data and individual-level data [11, 52, 53] that identified both smoking and obesity as significant risk factors for COVID-related mortality. There are a number of possible reasons why these factors were not significant in our models. First, despite the feature selection steps described above, there remains collinearity between some of our independent variables (Fig. 1 of Additional file 1). Both factors, for example, had negative collinearity with adults-with-college-degree, one of the factors found to be predictive in all our models. Post hoc analyses of the associations between these two health factors and all death-related outcomes were significant, which supports this explanation (Table 4 of Additional file 1). Second, it is possible that these clinical factors, which aggregate individual-level health characteristics, were not appropriate indicators of overall community-level status, which was captured better in the other domains included in Fig. 1.
The model diagnostics show significant variation in model quality. The models with the best diagnostics were models that only captured outcomes for the early days of the pandemic (30, 60, and 90 days). For longer time frames, the model is less accurate suggesting there were factors not captured in our model that affect the case and death counts. Additionally, other research teams attempting to account for the time-since COVID-19 was present in the county create a variable to indicate the days since the first case. As our findings demonstrate the results may not be as reliable when the time-period used to assess the relationship between SDoH and COVID-19 outcomes is large.
Other research teams that analyzed county-level COVID-19 outcomes for the entire U.S. and SDoH do not incorporate sheltering in place policies into their analysis as there are many counties that did not have a policy until after COVID-19 was present. We have found this to be one of the most important variables. It is negatively associated with all outcome variables for most models, indicating that, even when controlling for a wide variety of SDoH factors, having a policy in place before COVID-19 is present in a county may have a significant protective aspect.
Sensitivity analysis further demonstrated that our findings were sensitive to our choice of county-level start-date. This may be due to the slow growth rate of COVID-19 in some counties. If many counties do not reach 10/10,000 until months after the first case, there could be policy or behavior changes going on that affect the results. Regardless, we found that there was overlap between the results of our sensitivity analysis and our main findings and these may represent factors that are significantly associated with our outcomes. COVID-19 related, vulnerable populations, and population characteristics are the categories that were important for both the main analysis and the sensitivity analysis suggesting they may be among the most important factors causing county-level variation in outcome.
Post-hoc analyses of the 3% of counties excluded due to missing data showed that the strongest predictor of missingness was RUCC code. Relative to the overall distribution of RUCC codes, the highest rate of deletion occurred in the RUCC 9 category (82%). Specifically, 15% of RUCC 9 counties (completely rural or less than 2500 urban population, not adjacent to a metro area) were excluded due to missing data. Currently, however, the observed trends for included RUCC 9 counties align with the RUCC 7 and 8 counties, so we would not expect those missing counties to differ substantially.
Limitations and future directions
Our research has several limitations. First, we were unable to control for county-level COVID-19 testing rates as the data were unavailable. Therefore, our case counts may not be reliable as testing policies and resources differed across counties. This could be causing the underdispersion in our diagnostic models, particularly for the models with long timeframes. Similarly, there may be limitations in the quality of COVID-19 death data, which are derived largely from death certificates in the United States. While there are substantial quality control and certification efforts underway, the burden of the ongoing pandemic limits available resources for this task.(citation) The excess death data available through the CDC provide an additional source of information on COVID deaths that may serve as a valuable alternative or complement to death data, and should be explored in future analyses [54]. Not all SDoH features are updated at the same frequency and some features may not be as up-to-date as others. Our features did not always have consistent results across similar models; for instance, the RUCC code significance was not divided by rural and urban but a mix of both. However, given the comprehensive nature of our feature-set, our multiple analysis with different end-points, and our extensive diagnostic testing our findings are robust despite our limitations.