Recruiting participants
Recruitment primarily targeted public health experts, medical professionals, epidemiologists, modelers, risk assessment experts, vector control officials, microbiologists, individuals with on-the-ground understanding of conditions surrounding disease outbreaks, public health graduate students, and others who were interested in infectious disease outbreaks. However, forecasting was open to any interested participant. The research team coordinated with ProMED-mail, an infectious disease reporting newsletter that reaches over 80,000 subscribers in at least 185 countries, as well as other infectious disease newsletters, professional networks, and public health groups [12]. Skilled prediction traders recruited and vetted by Hypermind over several years of participation in its geopolitical and business prediction market were also invited to join the project on a voluntary basis [13, 14]. Thirty one percent (31%) of the participants were recruited during the initial recruitment effort in January 2019. Another 51% of participants joined during a second recruitment drive in July 2019. Additional participants were allowed to join at any time over the 15-month course of the project. Although differences in starting date may have limited comparison between participants, allowing additional participants to join expanded opportunities to attract active participants and garner the most accurate forecasts – one of the primary goals of this project. Prizes were awarded in three rounds for two six-month periods (January–June, 2019; July–December, 2019) and one three-month period (January–March, 2020). The awards were based on forecasting performance (see “scoring participants” below). For the first and second rounds, the first-place prize was $599 with descending amounts awarded to a 5th place prize of $100. A performance-based raffle was used to award 20 additional participants $50 each. For the shorter third round, a similar prize structure was used with lower amounts starting at $500 and only including 12 raffle winners, since the competition did not run for as long.
Developing the platform
The research team explored a number of potential approaches to an online disease prediction platform and chose Hypermind (previously known as Lumenogic), a company with extensive experience in crowdsourcing techniques, to assist in this process [15]. After evaluating both prediction markets and prediction polls, the research team considered prediction polling through Hypermind’s Prescience platform to be the most easily accessible to those without experience in commodities trading, which was considered an important factor in attracting and retaining participants. Hypermind’s Prescience platform was developed though experience with several Intelligence Advanced Research Projects Activity (IARPA) research programs on crowd-based forecasting for the US intelligence community [ 16–18]. As Fig. 1 illustrates, the platform allowed participants to forecast easily and quickly by assigning probabilities to possible outcomes. They could update their forecasts as needed, share reasoning for their forecasts, engage in conversations with other forecasters, access open source information about disease topics, and compete for performance-based prizes. Importantly, forecasters were made aware of a current aggregated forecast of the crowd for each question, as well as its evolution over time, but only after having made their first forecast in the relevant question. The platform was also lightly customized for the particular needs of this project, including a bespoke dashboard aimed at policy makers.
Developing forecasting questions
The research team developed an initial set of questions for the platform and added new questions, or “Individual Forecasting Problems” (IFPs), at a rate of approximately 1 per week. [See Supplementary material] New IFPs were generally added on Mondays, in conjunction with a weekly newsletter, to encourage continued interest and participation in the project. IFPs were focused on a range of infectious disease outcomes, including intensity of disease (e.g. number of US states with high influenza-like illness activity), spread of disease to different locations, and case counts. When developing IFPs, care was given to ensure that the wording had only one possible interpretation, that forecasters would be able to select a discrete outcome from a complete set of mutually-exclusive answers weeks or months ahead of its occurrence, and that the IFP could be fairly resolved by a pre-identified and authoritative source that provided timely information (i.e. if an IFP asked for a case count by a certain date, the resolution source needed to provide a reliable report on that date). The platform allowed the posting of two types of IFPs: “discrete” IFPs featured two or more discrete answers (e.g., yes/no, or Beni/Butembo/Katwa/Mandima), while “range” IFPs featured three or more interval answers arranged on a continuum (e.g., 20 or fewer cases/21–100/101–300/more than 300 cases). Figure 2 shows an example of a “range” IFP.
Scoring participants
The forecasting performance of each participant was measured relative to other participants’ for both timeliness and accuracy. Probability forecasts were scored using the Brier score [19] for discrete IFPs and its distance-sensitive ordered-categorical version [20] for range IFPs. Every day, the platform recorded each participant’s latest forecast for each IFP. If a participant had not made a forecast that day, his/her forecast from the previous day was carried over. When an IFP resolved according to ground truth, the score of each daily forecast was computed and compared to the median score of all other participants for that IFP on that day. Forecasts that were more accurate than the median led to point gains, while forecasts that were less accurate than the median caused forecasters to lose points. A participant whose score matched the median on a particular day scored 0 points on that day. On days before one had started forecasting an IFP, his/her daily score was imputed to be the median score obtained by all active forecasters, for better or worse. So as soon as one thought she could make a forecast that was better than most, she had incentives to do so.
Aggregating the crowd forecast
Individual forecasts for each question were aggregated using an algorithm developed and tested through experience with several IARPA research programs in geopolitical crowd forecasting [11]. Individual forecasts were first weighted, with greater weight given to forecasters who update their predictions frequently and who have a better past record of accuracy. The pool of individual forecasts was then reduced so that only the 30% most recent forecasts were retained for aggregation, while others were discarded. The weighted forecasts were then averaged. Finally, an extremization step was used to sharpen the resulting forecast and compensate for collective under-confidence [21]. As previously noted, individual forecasters had access to a crowd forecast while making their own, but that publicly-displayed crowd forecast reflected a degraded version of the full algorithm just described. It was the simple average of the 30% most recent forecasts in that IFP, not taking into account individual weights nor extremization. We wanted the forecasters to position themselves relative to the crowd’s opinion without giving them the benefit of the fully-optimized crowd wisdom.
Evaluating the crowd forecast
The crowd forecast’s absolute accuracy for each IFP was computed by averaging its daily Brier scores over the lifetime of the IFP. The overall accuracy of the aggregated forecasts was also computed as the average of its scores across all IFPs. But forecasting accuracy is only meaningful when compared to benchmarks, such as the “chance” forecast that would result from assigning equal probabilities to all possible outcomes in an IFP, or the accuracy of the individual forecasters themselves. The accuracy and timeliness of the crowd forecast were further evaluated in four increasingly severe ways: 1) the percentage of the lifetime of an IFP that the crowd forecast was more accurate than chance, 2) the point in the lifetime of an IFP at which the crowd forecast became irreversibly better than chance (the earlier the better), 3) the percentage of the lifetime of an IFP that the correct outcome was the crowd’s favorite, and 4) the point in the lifetime of an IFP at which the correct outcome became irreversibly the crowd’s favorite. For example, in the IFP described in Fig. 1, the crowd’s forecast was better than the chance forecast for 16 days out of 21, or 76% of the lifetime of that IFP. It became irreversibly better than the chance forecast on day 5, or 24% into the lifetime of that IFP. The crowd favorited the correct outcome for 10 out 21 days, or 48% of the lifetime of that IFP. The correct outcome became irreversibly the crowd’s favorite on day 12, or 57% into the lifetime of this IFP.