Skip to main content

Aspect-based classification of vaccine misinformation: a spatiotemporal analysis using Twitter chatter

Abstract

Background

The spread of misinformation of all types threatens people’s safety and interrupts resolutions. COVID-19 vaccination has been a widely discussed topic on social media platforms with numerous misleading and fallacious information. This false information has a critical impact on the safety of society as it prevents many people from taking the vaccine, decelerating the world’s ability to go back to normal. Therefore, it is vital to analyze the content shared on social media platforms, detect misinformation, identify aspects of misinformation, and efficiently represent related statistics to combat the spread of misleading information about the vaccine. This paper aims to support stakeholders in decision-making by providing solid and current insights into the spatiotemporal progression of the common misinformation aspects of the various available vaccines.

Methods

Approximately 3800 tweets were annotated into four expert-verified aspects of vaccine misinformation obtained from reliable medical resources. Next, an Aspect-based Misinformation Analysis Framework was designed using the Light Gradient Boosting Machine (LightGBM) model, which is one of the most advanced, fast, and efficient machine learning models to date. Based on this dataset, spatiotemporal statistical analysis was performed to infer insights into the progression of aspects of vaccine misinformation among the public. Finally, the Pearson correlation coefficient and p-values are calculated for the global misinformation count against the vaccination counts of 43 countries from December 2020 until July 2021.

Results

The optimized classification per class (i.e., per an aspect of misinformation) accuracy was 87.4%, 92.7%, 80.1%, and 82.5% for the “Vaccine Constituent,” “Adverse Effects,” “Agenda,” “Efficacy and Clinical Trials” aspects, respectively. The model achieved an Area Under the ROC Curve (AUC) of 90.3% and 89.6% for validation and testing, respectively, which indicates the reliability of the proposed framework in detecting aspects of vaccine misinformation on Twitter. The correlation analysis shows that 37% of the countries addressed in this study were negatively affected by the spread of misinformation on Twitter resulting in reduced number of administered vaccines during the same timeframe.

Conclusions

Twitter is a rich source of insight on the progression of vaccine misinformation among the public. Machine Learning models, such as LightGBM, are efficient for multi-class classification and proved reliable in classifying vaccine misinformation aspects even with limited samples in social media datasets.

Peer Review reports

Background

The last two years have been challenging to humanity as we all fight a global pandemic. With the fast spread of the novel coronavirus and the emergence of new variants world-wide, it was prime for scientists to discover vaccines to prevent and reduce the impact of the virus. In a quest to develop a suitable vaccine, many countries participated in phase 3 clinical trials around November 2020 [1]. Countries like the United States of America (USA), Canada, and the United Kingdom (UK) have registered 93, 22, and 25 clinical trials, respectively [1]. Amongst all the efforts, AstraZeneca produced by the University of Oxford provided a breakthrough with a success rate of 70.4% followed by Pfizer and BioNTech that were able to reduce the severity of 95% of the cases [2]. Statistics show a decline of approximately 84% in COVID-19 reported deaths during the vaccination process in the USA from December 2020 to July 2021 [3]. Similarly, the reported deaths declined by 96% in the UK throughout the same period [3]. As these vaccines performed well, countries like the USA and the UK began mandating the vaccines for public use; however, not a majority of the population supported the intake of a newly discovered and quickly tested vaccine. Simultaneously, new variants of the virus started spreading. Studies show that only 26% of the total hospitalizations of Delta variant in the US are of those who have taken the primary shots and 1% for those who have also taken the booster dose [4]. Similarly, the total hospitalizations of Omicron variant in the US are 33% and 12% for those who have taken the primary shots and those who have taken the booster dose, respectively [4]. However, the effectiveness of the vaccines was still a major concern.

Consequently, throughout and after the development of the various COVID-19 vaccines, the public continuously expressed their concerns, beliefs, and experiences regarding the vaccine jabs on social media platforms. These include the risk of severe events such as clots [5], the effectiveness, especially for children and elderly, as well as the long-term effects of various types of vaccines [6]. Twitter was among the most commonly used platforms for such public discussions, where people freely express and reflect their opinions and viewpoints. With the increasing number of vaccine-related tweets, an increase of shared misinformation was continuously reported. As a result, Twitter removed over 8400 tweets to control the spread of vaccine misinformation content amongst its users [7], while World Health Organization (WHO) released a toolkit to tackle vaccine misconceptions and introduced a hashtag (#VaccinesWork) that would bust myths and delusions regarding the vaccine [8]. Although efforts have been applied by health organizations to reduce the circulation of such messages, it is not adequate to eradicate people’s negative notions towards the vaccine.

Various research efforts attempted to analyze the public opinion and attitude towards the available vaccines. Most of the research work reviewed utilized sentiment analysis to provide insight into the growing anti-vaccine movements on Twitter [9]. For instance, Nezhad et al. [10] utilized sentiment analysis to classify the polarity of Persian COVID-19 vaccine tweets into positive, negative, and neutral. The proposed framework aims to understand the Iranian views and opinions towards local and imported vaccines. Although this proposed framework has proven efficiency, it was limited in geographical coverage as well as in time; the data was collected in a span of 5 months starting from April 2020. Furthermore, Marcec et al. [11] deployed sentiment analysis to provide insight into the public acceptance to three COVID-19 vaccines, namely, AstraZeneca/Oxford, Pfizer/BioNTech and Moderna. This study was focused on providing temporal and causal analytics in the span of 4 months starting from December 2020. This study was limited in time as well as vaccine types. The majority of these research studies were limited either in the timeframe [10, 11], geographical coverage [10, 12, 13] or to certain vaccines [11]. It is critical to take the timeframe, regional scope, and vaccine types into account to fully comprehend the development of vaccine misinformation. The timeframe should be in line with the vaccine launching and administration timeframe. The more comparable the timeframe of the dataset and the vaccine administration, the more insightful the analysis will be. It helps in analyzing timely events and their relevance to the progression of misinformation. Additionally, wider geographical coverage produces more insightful results because different types of COVID-19 vaccines were authorized for use in different countries. Furthermore, since COVID-19 social media hashtags are accessible globally, misinformation that originates in one country can impact other countries. Finally, to fully analyze the progression of misinformation, it is crucial to look at the various vaccine types. We speculate that some vaccines are more frequently and closely related to specific elements of misinformation. Therefore, a larger study's breadth can result in more thorough insights. Therefore, in an attempt to overcome some of these limitations, Yousefinaghani et al. [14] deployed Lexicon-based sentiment analysis to gain insight into the public attitude towards the COVID-19 vaccine on a large scale and over an extended timeframe. The proposed framework tracks the frequent hashtags and discussion topics as well as the progression of themes. Furthermore, the proposed framework compares the opinions from several locations and categorizes the opinions in the tweets into pro-vaccine, anti-vaccine, and hesitant. However, the study does not associate the revealed opinion with specific aspects of misinformation or sources of concern.

Even though some timeframe and geographical coverage limitations have been addressed, the existing frameworks do not provide full in-depth insight into public concerns [15]. In addition, despite the urgency and necessity of addressing the spread of misinformation, very few research works targeted misinformation related to COVID-19 vaccines on social media. Wonodi et al. [16] conducted interviews with representative health and management figures to provide some understanding of the sources of vaccine hesitation, misinformation, and conspiracy theories in Nigeria. However, in these times and under such circumstances, the necessity of automating the analysis process arises. Hayawi et al. [17] proposed the use of multiple Machine Learning (ML) models, including XGBooster, LSTM, and a pre-trained BERT model, to detect COVID-19 vaccine misinformation. Although this framework proved efficient in detecting misinformation and provides some insight, it still lacks in-depth insight into the sources of the spread of misinformation. Sentiment and opinion analysis provide an abstract and broad understanding of the problem and public concerns, which might be enough for tackling some issues; however, this is not the case for vaccine misinformation. In order to provide full insight and perspective to stakeholders, a better understanding of the sources of the concerns and the aspects of the misinformation is needed. To this end, this paper proposes a novel framework for detecting and categorizing aspects of COVID-19 vaccine misinformation on Twitter. The proposed framework provides a comprehensive and solid understanding of the the aspects of the vaccine misinformation among the public. Moreover, it provides a solid understanding of the spatiotemporal progression of the common misinformation aspects related to the various available vaccines. This is achieved through providing analytics factored by the vaccine type, time, and location. This research spans a period of about 7 months, including the introduction and widespread administration of the majority of the common vaccines. Additionally, this research takes into account the analysis of data from Twitter chatter centered on the 10 most popular vaccines.

This work aims to support stakeholders around the globe to combat COVID-19 vaccine misinformation through various representative and comprehensive analytical results. To the authors’ best knowledge, this is the first study analyzing and modeling categories of misinformation through aspect analysis on COVID-19 vaccine Twitter data. In addition, this study publishes the first annotated dataset for aspects of vaccine misinformation. As summarized in Table 1, the reported studies related to COVID-19 vaccine analyze either sentiment or misinformation only, but not the sources and aspects of misinformation. Furthermore, only a few studies published the related datasets.

Table 1 Review of existing research works related to COVID-19 vaccine

Methods

This section describes the methodology of the proposed vaccine misinformation aspect analysis framework. As illustrated in Fig. 1, the pipeline starts with data collection and annotation, followed by cleaning and preprocessing, finally, the prediction and the analytics, including the training, validation, and evaluation of the Machine Learning models used.

Fig. 1
figure 1

CovVax misinformation aspects analytics framework

Dataset

To annotate a training dataset for COVID-19 vaccine misinformation aspects, we utilized the ANTi-Vax English dataset containing COVID-19 vaccine-related tweet IDs collected and organized between December 20, 2020 to July 21, 2021 [17, 18]. This dataset was collected by filtering Twitter data using a predefined set of keywords, including: ‘vaccine’, as well as specific COVID-19 vaccine names such as ‘Sinopharm’, ‘Moderna’, etc. This dataset was introduced as a vaccine misinformation dataset, which serves the purpose of the proposed framework. Moreover, the timeline of data collection is sufficient to provide insight into the progression and evolution of misinformation aspects. The dataset is collected over a span of approximately 7 months, starting with the approval of the first vaccine in December 2020 and covering the approvals and distribution of the common vaccines, namely, Pfizer, AstraZeneca, Sinopharm, Sputnik, Moderna, Johnson&Johnson, Covaxin, Covishield, SARS-CoV-2, Sinovac.

Since only the tweets IDs and labels were published, the IDs were hydrated to extract the full tweet’s content for every corresponding tweet ID in a comma-separated values file (CSV). The resulting CSV file includes the text of the tweets as well as related metadata, e.g., location and timestamp, which are necessary for spatiotemporal analytics. After the hydration process, approximately 12 k tweets are parsed, 4 k of which were labeled as by the authors as misinformation.

Data annotation

The annotation task was conducted over two stages. First, misinformation labels annotated by the authors of the published misinformation dataset were manually validated. Second, aspects of misinformation were categorized, labeled, and verified relevant to verified medical sources. Although the ANTi-Vax Twitter dataset [17, 18], was annotated for misinformation detection, multiple misclassifications were detected during the manual validation of the dataset. Therefore, the data was re-annotated manually for misinformation detection.

Table 2 shows the rules used for both annotation stages. These rules were obtained from reliable medical resources, including the Centers for Disease Control and Prevention (CDC) [19, 20]. The guidelines were developed based on a Toolkit from UNICEF [21], similar annotation guides [22, 23], as well as policies and guides used by Twitter [24]. The aspects’ guidelines were refined throughout the process of annotation to achieve comprehensive coverage of all discussed topics related to vaccine misinformation concerning the public on Twitter.

Table 2 Annotation rules and guidelines used to categorize aspects of misinformation

Throughout the first stage, the tweets were read manually and annotated as misinformation if they explicitly or implicitly contain any of the myths or misinformation in the rules and guidelines. For instance, Table 3 shows samples of tweets that were annotated as misinformation or otherwise.

Table 3 First annotation stage: sample tweets annotated as misinformation vs not misinformation

After the first stage of annotation, only tweets including misinformation were annotated again in the second stage to identify the misinformation aspect. During this stage, the aspects were assigned based on the content of the tweets as well as the text of the hashtags used. Both explicit and implicit aspects were considered in this stage too. The dominant aspect was considered for multi-aspect tweets. Table 4 shows samples related to the four aspects of misinformation.

Table 4 Second annotation stage: sample tweets annotated with aspects of misinformation

Figure 2 demonstrates the misinformation aspects distribution. It can be seen that “Agenda” as well as “Efficacy and Clinical Trials” were the dominant aspects representing public concerns about the vaccine, making up for almost 80% of the dataset.

Fig. 2
figure 2

Vaccine misinformation aspects’ distribution

Data preprocessing

The annotated dataset was passed through the preprocessing and cleaning pipeline prior to training the models. First, the pipeline starts by dropping the duplicated tweets in the dataset produced by retweets. Second, the external URLs and links, and punctuation were removed. Third, the negation was resolved. Resolving negation is a common step in the pre-processing pipeline in Natural Language Processing. Negation words such as “no”, “not”, “never”, … etc., can significantly alter the meaning and semantic orientation of a sentence. Therefore, in this step, the framework detects negated words and modifies the sentiment of the following part of the sentence accordingly. However, a predefined set of negation-related stopwords were kept, e.g., don’t, no, not, etc., since negations are of extreme importance for prediction and analytics in the context of misinformation. For instance, it is crucial to detect negations of vaccine efficacy, e.g., “This vaccine is not a true vaccine. It is a trial”. Finally, the text was converted to lowercase.

Results

Prediction of aspects of vaccine misinformation

Multiple experimental trials were conducted to achieve the highest possible performance for the aspect prediction model. Out of those models, the optimized LightGBM model with 50 features outperformed the experimental models. This model was trained on the stratified annotated vaccine misinformation dataset with a split of 85% for training and 15% equally split between validation and testing. Table 5 shows the dataset distribution.

Table 5 Dataset distribution

The LightGBM aspect classification model was able to achieve a validation ensemble accuracy of 80.1% and a testing accuracy of 71.3%. Accuracy is the performance evaluation metric used throughout the evaluation of the experimental results. For each of the vaccine misinformation aspects, for instance, “Vaccine Constituent,” the classification results could be any of the following:

  • True Positive (TP): Predicted aspect is “Vaccine Constituent” and the ground truth is “Vaccine Constituent.”

  • True Negative (TN): Predicted aspect is any non-“Vaccine Constituent” aspect, and the ground truth is any non-“Vaccine Constituent” aspect

  • False Positive (FP): Predicted aspect is “Vaccine Constituent” while the ground truth is any non-“Vaccine Constituent” aspect

  • False Negative (FN): Predicted emotion is any non-“Vaccine Constituent” aspect, while the ground truth is “Vaccine Constituent.”

Accuracy is the percentage of correctly labeled tweets.

$$Accuracy= \frac{TP+TN}{TP+FP+FN+TN}$$

Table 6 shows the obtained per-class (i.e., per an aspect accuracy).

Table 6 Vaccine misinformation aspect experimental results per class

Figure 3 shows the Receiver Operating Characteristic Curve (ROC) of the validation and testing, respectively. The obtained testing Area Under the ROC Curve (AUC) is 90.3% and 89.6% for validation and testing, respectively. These results indicate acceptable performance, considering the skewness of the data.

Fig. 3
figure 3

Receiver Operating Characteristic (ROC) Curve

Spatiotemporal analytics

In order to provide an in-depth insight of the progression of vaccine misinformation aspects, multiple analytics were produced. For instance, Fig. 4 illustrates the number of misinformation tweets associated with each type of vaccine. It can be seen that Pfizer, Moderna, and AstraZeneca had the majority of the public misinformation. Pfizer, specifically, was by far the most discussed in the misinformation tweets relevant to all aspects, making up for approximately 53% of the total tweets in the dataset.

Fig. 4
figure 4

Frequency of misinformation tweets per vaccine type

Furthermore, Fig. 5 illustrates the number of misinformation tweets factored by aspect for each vaccine type. It can be seen that Pfizer, i.e., represented by the green bar, generated most of the vaccine misinformation discourse relevant to all the aspects. Moreover, Moderna and AstraZeneca, i.e., represented by yellow and light blue bars respectively, were the second most discussed vaccines in the misinformation tweets relevant to all the defined misinformation aspects. However, it is observed that Moderna was specifically associated with the aspects of “Vaccine Constituents” and “Efficacy and Clinical Trials” more frequently than AstraZeneca since it is an mRNA-based vaccine with questionable efficacy among the public. Meanwhile, AstraZeneca was associated more frequently than Moderna with the aspect of “Adverse Affects” given that many cases reported blood clots and heart problems after taking the vaccine. Furthermore, it can be seen that the aspect of “Agenda Discussions” associated with Pfizer was almost three times as frequent as AstraZeneca and exceeded the total frequency of tweets reporting the same aspect of misinformation relevant to all other vaccine types. The fact that Pfizer is an mRNA-based vaccine triggered much discourse among the public, which was reported in their tweets.

Fig. 5
figure 5

Frequency of misinformation tweets per vaccine and per aspect

Figure 6 shows the percentage of misinformation aspects associated with the three most discussed vaccine types on Twitter: Pfizer, Moderna, and AstraZeneca.

Fig. 6
figure 6

Misinformation aspects associated with the three most common vaccines

To understand the spatiotemporal progression of the vaccine misinformation, the tweets' geo-location, and timestamp metadata of Twitter chatter were used to develop spatial and temporal analytics. To illustrate the geographical spread of the most dominant vaccine misinformation aspects, Fig. 7 shows the spatial distribution obtained from the Twitter data. Figure 7 provides a visual representation of the prevalence of the four vaccine misinformation aspects considered in this research. Blue refers to ‘Efficacy and Clinical Trials’, green refers to ‘Agenda’, pink refers to ‘Adverse Effects, and orange refers to ‘Vaccine Constituent’. The map shows spatial dominance at two levels: coverage and intensity. Regions are colored with a specific color to indicate the dominance of a certain aspect of misinformation in that region. The intensity of the color indicates the frequency or the prevalence of particular aspects of misinformation in specific regions. Furthermore, overlapping colors indicate several aspects of misinformation prevalent in that particular region. Finally, circles, indicate more specific areas such as cities that were mentioned in the Twitter chatter data.

Fig. 7
figure 7

Aspects of vaccine misinformation spatial analysis

By mapping the spatial distribution of the aspects of misinformation, policymakers can identify regions that require s specific interventions aimed at correcting misguided beliefs with evidence-based messaging.

As illustrated in Fig. 7, the vaccine “Efficacy and Clinical Trials” was the most dominant aspect in European countries as well as Middle Eastern and African countries. With 33 total registered trials in Russia, 18 in South Africa, and less than 10 trials in the rest of Africa [1], Twitter users expressed their growing concern on the insufficient testing of the experimental jabs. It was also observed that “Efficacy and Clinical Trials” as well as “Agenda” misinformation aspects were dominant in South Africa, where some experimental trials were conducted before approving the vaccine. Meanwhile, Twitter users were more concerned about the vaccine being part of an agenda, whether it is depopulation or market profit, in the USA and Canada. Throughout the study duration, most of the tweets concerned about the adverse effects were in Australia, Colombia, and Japan. Several adverse events were reported worldwide and in Australia [25], including testing positive for HIV as many claimed.

Figure 8 shows the temporal progression of misinformation aspects over the dataset timeline, starting from the approval of the first vaccine in December 2020 until July 2021.

Fig. 8
figure 8

Aspects of vaccine misinformation temporal analysis

As shown in Fig. 8, the “Efficacy and Clinical Trials” was the dominant aspect during the first three months, starting from the first official approval of the vaccine. Post the official approval, and as published by the American Journal of Managed Care (AJMC) [25], many countries, including the UK, began distributing the vaccines. In addition, vaccination acceleration plans were adopted in the USA. In early February and March, multiple efficacy-related reports were published. Simultaneously, multiple variants started to spread, raising the concern about the efficacy of the vaccines.

An increasing public controversy arose in March, associating blood clots with AstraZeneca after several reported cases. In response to those cases, in South Africa and many European countries, the vaccine was temporarily suspended to further investigate the association of the vaccine with blood clots. Moreover, AstraZeneca issued a statement refusing the causality link between its vaccine and blood clots and experts stressed that there is no causal link [25]. This can be seen in the temporal progression of the curve associated with the aspect “Efficacy and Clinical Trials”, represented in green color in Fig. 8. The curve grows between February and March, then, gradually declines toward April upon the release of experimental statistics negating the relationship between the vaccine and the reported cases. Toward the end of March, Pfizer and Moderna released positive efficacy data and survey results reporting a drop in vaccine hesitancy [26]; hence, the decline of the “Efficacy and Clinical Trials” aspect curve.

Furthermore, the “Agenda” aspect, represented by the yellow color in Fig. 8, starts to peak in April. This can be explained by the latest events starting from late April when fake Pfizer vaccines were reported in Mexico and Poland [27] as well as the distribution of vaccines, especially AstraZeneca, outside the USA.

The “Efficacy and Clinical Trials” aspect peaks again during May 2021. May witnessed the preparation for authorization and approval of Pfizer vaccine in adolescents [25]. Moreover, multiple vaccines, including doses of Johnson & Johnson, Pfizer, Moderna, and AstraZeneca—that were not approved yet in the USA- were shipped out from the USA [25]. In addition, several cases of adverse effects, including heart problems, were reported towards the end of May. This is reflected in the analysis, as the “Efficacy and Clinical Trials” as well as “Adverse Effects” aspects peak. Furthermore, it is observed that “Agenda” remains the most dominant aspect and remarkably peaks in May, exceeding the peaks of all aspects throughout the entire duration. The peak of the three previous aspects was aligned due to the relatedness of the topics involved in discussing these aspects. This indicates that the spread of misinformation leads to the public being concerned about having intentionally rushed and untested vaccines that have adverse and dangerous effects that fit into depopulation and profit agendas. Finally, in relation to the timeline, in June 2021, employers in the USA were permitted by the Equal Employment Opportunity Commission to mandate vaccination among their employees [25].

To evaluate the correlation between the misinformation tweets’ count and the number of vaccinations administered world-wide during the same timeframe, Standard Pearson correlation is calculated. This work hypothesizes that these two variables are negatively correlated at significance level of 0.05. The Pearson correlation coefficient and p-values are calculated for the global misinformation count against the vaccination counts of 43 countries from December 2020 until July 2021. The correlation values varied, but all values indicated negative correlations with values ranging from -0.349 to -0.915. The highest numerical value indicates stronger negative correlation between the study variables, indicating that the increase of misinformation tweet leads to the decrease of vaccinations. To assess the significance of the correlation, a p-value threshold of 0.05 is used. Values equal to or less than 0.05 are considered significant, indicating a strong significant correlation. On the other hand, p-values greater than 0.05 are considered insignificant. Figure 9 plots the correlation of the misinformation count against the vaccination count for 43 countries, and divides them into significant, and insignificant correlations. The gradient shades of the bar charts are proportional to the p-values, where darker values indicate lower p-values, hence, more significant correlation. The graph shows that 37% of the countries were negatively affected by the spread of misinformation. Since misinformation tweets’ spread on social media is globally accessible, we deem it imperative to consider their significance beyond local contexts.

Fig. 9
figure 9

Pearson correlation coefficient of misinformation tweet counts against vaccination counts between December 2020 and July 2021

Discussion

Main findings

Machine Learning classification models prove reliable in detecting aspects of misinformation from the public chatter on Twitter with high accuracy results, as summarized in Table 6. The spatiotemporal analysis further validates the reliability of the proposed misinformation aspect classification framework. Spatiotemporal analytics are aligned with real-life temporal vaccine-related events. The progression of aspects of the vaccine misinformation on Twitter was directly associated with the reported vaccine-related news in different countries. Certain vaccines were more associated with specific aspects of vaccine misinformation among the public. For instance, throughout the span of this study, Pfizer was associated with the majority of misinformation tweets. These findings prove that decision-makers and policy officials can benefit from the analysis of the progression of vaccine misinformation on social media to promptly understand public concerns and design intervention plans to combat the spread of such misinformation.

Interpretation

The findings of this study show a clear association and correlation between the progression of vaccine-related misinformation and public vaccine hesitancy. According to global and regional vaccination data, several countries show slower vaccination rates during periods of increased spread of misinformation. On the other hand, the trend lines are steeper, indicating higher vaccination rates, during periods of the decreased spread of misinformation [28]. The public’s opinions reformulate and fluctuate relevant to several life events. Social media is a public venue for the exchange of public opinions. The majority of people do not solicit their information from reliable medical sources and rely on social media as a source of information that influences their actions. Consequently, the spread of misinformation around critical medical resolutions can drastically impact people’s safety and nations’ public health. This results of this work prove the hypothesis that there is significant negative correlation between misinformation spread and vaccination rates for 37% of the countries in the study. Hence, providing instant and accurate insight into the aspects of misinformation on social media can significantly support policymakers in understanding public concerns, hence, taking actions to combat misinformation and raise public awareness.

Limitations

While this study was comprehensive in terms of spatiotemporal analysis and validation with real-life events, it can be extended to cover a larger timeframe. Moreover, although this study considers vaccine-related tweets worldwide, it was only focused on English tweets which were the most prominent on Twitter. Hence, this study can be further extended to analyze multilingual datasets; tweets in native languages are valuable and would provide fine-grained insights. Furthermore, given the incomparable popularity of Twitter across different regions and populations, the data collected, and the corresponding interpretation may be more representative of regions where Twitter is the most commonly used social media platform. However, this study may not fully reflect the progression of vaccine misinformation for regions where Twitter is not the most popular social media platform. Thus, future work may consider supplementing Twitter chatter data with data from other social media platforms that are more popular in specific regions.

Conclusion

COVID-19 vaccine hesitancy is a primary worldwide concern since it significantly affects public health. Misinformation contributes significantly to vaccine hesitancy among the public. Social media platforms have witnessed an increasing number of shared misinformation, especially Twitter. That is because people tend to freely and informally express their opinions and share their thoughts on Twitter. Although several global associations and organizations attempted to fight the spread of vaccine misinformation, the efforts were limited. Moreover, the reviewed literature was limited to sentiment analysis of COVID-19 vaccine tweets, with only a few studies focused on misinformation in general but not the specific aspects of misinformation. To this end, this paper is the first to propose a novel Misinformation Aspect Analysis framework that detects and classifies COVID-19 vaccine misinformation into medically verified aspects. The manually annotated dataset of vaccine misinformation used in this paper is publicly available on GitHub. Moreover, this framework produces a variety of spatiotemporal analytics that aim to support several stakeholders in assessing the situation and making positive intervention plans accordingly. These analytics provide timely and in-depth insights into the spatial and temporal progression of vaccine misinformation and sources of concerns.

This framework deployed a LightGBM model for classifying misinformation aspects and achieved per-class accuracies of 87.4%, 92.7%, 80.1%, and 82.5% for the “Vaccine Constituent,” “Adverse Effects,” “Agenda,” “Efficacy and Clinical Trials” aspects, respectively. Furthermore, the model achieved an Area Under the ROC Curve (AUC) of 90.3% and 89.6% for validation and testing, respectively. In addition, the findings and insights derived from the spatiotemporal analytics are consistent with the timeline of events related to the COVID-19 vaccine world-wide, proving the reliability of the proposed model.

Availability of data and materials

The annotated COVID-19 vaccine misinformation aspects dataset is available for researchers. Only the tweet IDs and labels of misinformation aspects are available, complying with Twitter’s data privacy and terms of service.

The dataset is accessible on GitHub through this link: https://github.com/VaccineABSA/CoVax-Aspects/blob/main/README.md.

Abbreviations

ML:

Machine Learning

BERT:

Bidirectional Encoder Representations from Transformers

LSTM:

Long Short-Term Memory

LightGBM:

Light Gradient Boosting Machine

TP:

True Positive

TN:

True Negative

FP:

False Positive

FN:

False Negative

ROC:

Receiver Operating Characteristic Curve

AUC:

Area Under the ROC Curve

CSV:

Comma-Separated Values

References

  1. Vaccination Rates, Approvals & Trials by Country – COVID19 Vaccine Tracker [Internet]. Available from: https://covid19.trackvaccines.org/trials-vaccines-by-country/. Cited 28 Mar 2022.

  2. Marsh S. The history of Covid vaccine development | Coronavirus | The Guardian [Internet]. 2021. https://www.theguardian.com/world/2021/dec/08/the-history-of-covid-vaccine-development.

    Google Scholar 

  3. COVID-19 vaccinations vs. COVID-19 deaths, Dec 1, 2020 to Jul 1, 2021 [Internet]. Available from: https://ourworldindata.org/grapher/covid-vaccinations-vs-covid-death-rate?tab=table&time=2020-12-01..2021-07-01&country=FRA~DEU~ITA~GBR~USA~CAN~JPN~BRA~AUS~ZAF. Cited 28 Mar 2022.

  4. Taylor CA, Whitaker M, Anglin O, Milucky J, Patel K, Pham H, et al. Morbidity and Mortality Weekly Report COVID-19-Associated Hospitalizations Among Adults During SARS-CoV-2 Delta and Omicron Variant Predominance, by Race/Ethnicity and Vaccination Status-COVID-NET, 14 States. 2022; Available from: https://www.medrxiv.org/content/10.1101/2021.08.27.21262356v1. Cited 20 May 2022.

  5. COVID Vax Opponents and Rigid Proponents...Are Both Anti-Science? | MedPage Today [Internet]. Available from: https://www.medpagetoday.com/infectiousdisease/covid19vaccine/92413. Cited 28 Mar 2022.

  6. Ledford H, Cyranoski D, Van Noorden R. The UK has approved a COVID vaccine - here’s what scientists now want to know. Nature. 2020;588:205–6 NLM (Medline).

    Article  CAS  PubMed  Google Scholar 

  7. Twitter. Updates to our work on COVID-19 vaccine misinformation [Internet]. 2021. Available from: https://blog.twitter.com/en_us/topics/company/2021/updates-to-our-work-on-covid-19-vaccine-misinformation.

  8. World Health Organization WHO. A Social Media Toolkit for Healthcare Practitioners - desktop [Internet]. 2021. https://www.who.int/publications/m/item/a-social-media-toolkit-for-healthcare-practitioners---desktop.

    Google Scholar 

  9. Alam KN, Khan MS, Dhruba AR, Khan MM, Al-Amri JF, Masud M, et al. Deep Learning-Based Sentiment Analysis of COVID-19 Vaccination Responses from Twitter Data. Comput Math Methods Med. 2021;2021:4321131.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Bokaee Nezhad Z, Deihimi MA. Twitter sentiment analysis from Iran about COVID 19 vaccine. Diabetes Metab Syndr Clin Res Rev. 2022;16(1):102367.

    Article  CAS  Google Scholar 

  11. Marcec R, Likic R. Using Twitter for sentiment analysis towards AstraZeneca/Oxford, Pfizer/BioNTech and Moderna COVID-19 vaccines. Postgrad Med J [Internet]. 2021;0:1–7 Available from: https://pmj.bmj.com/content/early/2021/08/08/postgradmedj-2021-140685. Cited 28 Mar 2022.

    Google Scholar 

  12. Bustos VP, Comer CD, Manstein SM, Laikhter E, Shiah E, Xun H, et al. Twitter Voices: Twitter Users’ Sentiments and Emotions About COVID-19 Vaccination within the United States. Eur J Environ Public Heal. 2022;6(1):em0096.

    Article  Google Scholar 

  13. Alabrah A, Alawadh HM, Okon OD, Meraj T, Rauf HT. Gulf Countries’ Citizens’ Acceptance of COVID-19 Vaccines—A Machine Learning Approach. Mathematics [Internet]. 2022;10(3):467 Available from: https://www.mdpi.com/2227-7390/10/3/467/htm. Cited 28 Mar 2022.

    Article  Google Scholar 

  14. Yousefinaghani S, Dara R, Mubareka S, Papadopoulos A, Sharif S. An analysis of COVID-19 vaccine sentiments and opinions on Twitter. Int J Infect Dis. 2021;1(108):256–62.

    Article  Google Scholar 

  15. Do Nascimento IJB, Pizarro AB, Almeida JM, Azzopardi-Muscat N, Gonçalves MA, Björklund M, et al. Infodemics and health misinformation: a systematic review of reviews. Bull World Health Organ [Internet]. 2022;100(9):544 Available from: https://www.pmc/articles/PMC9421549/. Cited 26 May 2023.

    Article  Google Scholar 

  16. Wonodi C, Obi-Jeff C, Adewumi F, Keluo-Udeke SC, Gur-Arie R, Krubiner C, et al. Conspiracy theories and misinformation about COVID-19 in Nigeria: Implications for vaccine demand generation communications. Vaccine. 2022;40(13):2114–21.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Hayawi K, Shahriar S, Serhani MA, Taleb I, Mathew SS. ANTi-Vax: a novel Twitter dataset for COVID-19 vaccine misinformation detection. Public Health [Internet]. 2022;203:23–30 .https://www.pmc/articles/PMC8648668/

    Article  CAS  PubMed  Google Scholar 

  18. Hayawi, Kadhim, Sakib Shahriar, Mohamed Adel Serhani, Ikbal Taleb and SSM. GitHub - SakibShahriar95/ANTiVax: A novel dataset containing over 15 Million COVID-19 vaccine-related tweets and 15 Thousand labeled tweet for vaccine misinformation detection [Internet]. 2022:23–30. Available from: https://github.com/SakibShahriar95/ANTiVax. Cited 13 Mar 2022.

  19. CDC. Myths and Facts about COVID-19 Vaccines | CDC [Internet]. 2021. Available from: https://www.cdc.gov/coronavirus/2019-ncov/vaccines/facts.html. Cited 13 Mar 2022.

    Google Scholar 

  20. CDC. Myths and Facts about COVID-19 Vaccines for Children | CDC [Internet]. 2021. Available from: https://www.cdc.gov/coronavirus/2019-ncov/vaccines/children-facts.html. Cited 13 Mar 2022.

    Google Scholar 

  21. UNICEF. Vaccine Misinfo Guide [Internet]. 2020. Available from: https://vaccinemisinformation.guide/. Cited 13 Mar 2022.

    Google Scholar 

  22. Alam F, Dalvi F, Shaar S, Durrani N, Mubarak H, Nikolov A, et al. Fighting the COVID-19 Infodemic in Social Media: A Holistic Perspective and a Call to Arms. In: ojs.aaai.org [Internet]. 2021:913–22. Available from: https://sci-hub.ru/https://ojs.aaai.org/index.php/ICWSM/article/download/18114/17917/21609. Cited 13 Mar 2022.

  23. Alam F, Shaar S, Dalvi F, Sajjad H, Nikolov A, Mubarak H, et al. Fighting the COVID-19 Infodemic: Modeling the Perspective of Journalists, Fact-Checkers, Social Media Platforms, Policy Makers, and the Society. In: arxiv.org [Internet]. 202:611–49. Available from: https://sci-hub.ru/https://arxiv.org/abs/2005.00033. Cited 13 Mar 2022.

  24. Twitter Help Center. COVID-19 misleading information policy [Internet]. Twitter. 2021. Available from: https://help.twitter.com/en/rules-and-policies/medical-misinformation-policy. Cited 13 Mar 2022.

  25. A Timeline of COVID-19 Vaccine Developments in 2021 [Internet]. AJMC. Available from: https://www.ajmc.com/view/a-timeline-of-covid-19-vaccine-developments-in-2021. Cited 28 Mar 2022.

  26. Huettman E. Covid Vaccine Hesitancy Drops Among All Americans, New Survey Shows | Kaiser Health News. Kaiser Health News [Internet]. 2021;1–5. Available from: https://khn.org/news/article/covid-vaccine-hesitancy-drops-among-americans-new-kff-survey-shows/. Cited 11 Apr 2022.

  27. Hopkins J, Córdoba J de. Pfizer Identifies Fake Covid-19 Shots Abroad as Criminals Exploit Vaccine Demand. The Wall Street Journal [Internet]. 2021. Available from: https://www.wsj.com/articles/pfizer-identifies-fake-covid-19-shots-abroad-as-criminals-exploit-vaccine-demand-11619006403. Cited 11 Apr 2022.

  28. Mathieu E, Ritchie H, Ortiz-Ospina E, Roser M, Hasell J, Appel C, et al. A global database of COVID-19 vaccinations. Nat Hum Behav. 2021;5(7):947–53.

    Article  PubMed  Google Scholar 

Download references

Acknowledgements

The authors would like to thank their colleagues in the College of Health Sciences for verifying their medical information related to the vaccine’s misinformation.

Funding

This research project was funded by the Office of Research & Sponsored Programs, Abu Dhabi University.

Author information

Authors and Affiliations

Authors

Contributions

H.I. conceived the idea, designed the study, and proposed the classification model, contributed to spatiotemporal analysis, manuscript writing and supervised the overall work. N.H. contributed to the classification model design, implementation, spatiotemporal statistical analysis, and manuscript writing, R.E. contributed to the spatiotemporal statistical analysis and the data curation, and S.A. contributed to the data curation. M.E. contributed to the manuscript writing and the overall supervision.

Corresponding author

Correspondence to Heba Ismail.

Ethics declarations

Ethics approval and consent to participate

The paper does not use any human or animal subjects as part of the conducted experiments. Experiments were conducted on Twitter chatter data, and the published dataset is anonymized in compliance with the Twitter privacy policy. Only tweets IDs are published.

Consent for publication

The experiments do not involve any individual details or personal identifiers. Experiments were conducted on Twitter chatter data, and the published dataset is anonymized in compliance with the Twitter privacy policy.

Competing interests

All authors confirm that there are no competing conflicts of interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ismail, H., Hussein, N., Elabyad, R. et al. Aspect-based classification of vaccine misinformation: a spatiotemporal analysis using Twitter chatter. BMC Public Health 23, 1193 (2023). https://doi.org/10.1186/s12889-023-16067-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12889-023-16067-y

Keywords