Superforecasters in the Cosmic Bazaar

When predicting the future becomes a government mandate

Feb 14, 2023

Many political and economic forecasters use the early part of the year to make predictions about what the next 12 months will bring. Few of them remind their readers about the predictions they made the previous year, making it hard to assess whether they are worth reading. Across the UK government there are people who, unlike media talking heads and columnists, are constantly evaluating the accuracy of their predictions. A group of civil servants, intelligence professionals, diplomats, and academics, of varying degrees of seniority, have woken up, logged onto a website, and anonymously offered their best forecasts in answer to questions on geopolitical flashpoints. The site encourages debate and interaction. Users can “upvote” comments by others and there are seminars on the topics the questions address. The tournament is exotically named “Cosmic Bazaar,” and it was created to improve the UK’s intelligence analysis, especially long-range forecasting on events that are pushed down the agenda by whatever the current crisis happens to be.

In 1814, mathematician Pierre-Simon Laplace claimed that if there was an intellect vast enough to comprehend the position of every atom in the universe and able to analyze all the forces of nature set at motion at one point in time, it could embrace in a single formula the movements of the greatest bodies of the universe and those of the tiniest atom. For such an intellect nothing would be uncertain. The future would be present before its eyes. This intellect has been called Laplace’s demon. Well, we don’t have a demon, but we now have a Cosmic Bazaar.

Since 2020, more than 10,000 forecasts have been made by 1,300 forecasters from 41 government departments and several allied countries. The site has around 200 regular forecasters, who use only publicly available information to answer the 30-40 questions that are live at any time. Users are ranked by the accuracy of their predictions: a high confidence level in a correct prediction scored higher than a low confidence in the same prediction and a high confidence in an inaccurate forecast scored lower than a low confidence in an inaccurate forecast. The tournament is inspired by similar competitions in the United States.

In 2006, after the failure to find weapons of mass destruction (WMDs) in Iraq, the Intelligence Advanced Research Projects Activity (IARPA) was created. Its mission is to fund cutting edge research with the potential to make the intelligence community smarter and more effective. In 2010, IARPA decided to sponsor a tournament to see who could invent the best methods of making the sort of forecasts that intelligence analysts make every day. These forecasts would attach degrees of probability and certainty to future events. In forecasting it is better to think of many potential futures rather than The Future. Laplace popularized and built on the work of another 18th-century mathematician, Thomas Bayes, best known for the theorem named after him, which describes the probability of an event, based on prior knowledge of conditions that might be related to the event. Laplace was responsible more than any other for developing the Bayesian interpretation of probability. Where our understanding is less than perfect, where we don’t have the complete knowledge of Laplace’s demon, we use probability to forecast future events.

While designing the tournament, IARPA officials met Dr. Philip Tetlock of the University of Pennsylvania. Tetlock had been running forecasting tournaments for over 20 years. Over this period, he used a group of almost 300 experts from a variety of different fields to make over 28,000 predictions. His findings suggested that most of these experts were only slightly better at forecasting than leaving the answers to chance. Forecasters with the biggest media presence had the worst accuracy (he suggests the need for a story often means those with outlier or controversial views are often favored over those with track records of accuracy). Tetlock did however identify some highly accurate forecasters, who were smart, well-versed in current affairs, would regularly revise their predictions as new information came to light, and were conscious of and tried to avoid typical human biases. He called these “superforecasters.”

What Tetlock discovered though was that the aggregate forecasts were more accurate over time than any single individual. This utilizes a concept identified by the scientist Francis Galton in 1907, now known as “the wisdom of crowds.” Galton observed a contest at an English country fair to estimate the weight of an ox. The median guess of nearly 800 people was just one pound short of the actual weight (1,198 pounds). Many of these people had bits of information that would help point to the right answer. One might have remembered the weight of a similar ox from last year’s fair, another might have been a butcher used to working with meat, another a restaurant owner who buys large quantities of ox. All the valid information pointed in one direction—toward the right answer—while all the errors pointed in different directions. Tetlock augmented the wisdom of crowds with research into whether and how people make good judgments and common thinking errors.

The IARPA tournament became the Aggregative Contingent Estimation (ACE) programme, run from 2010 to 2015. Groups competed to beat the combined forecast success rate. Tetlock entered a team into the competition and called it the Good Judgment Project. He selected a group of “superforecasters” from his previous tournaments, grouping them together as a team to produce a team aggregate, and then supplemented them with an algorithm that made two tweaks to their aggregated predictions.

He tracked the performance of the top forecasters within the group of “superforecasters” and then used an algorithm to weight their predictions more heavily than the rest. Secondly, he extremized confidence. Not everyone gets all the information, so their estimates of confidence are not as high as someone who has seen all the information. Tetlock could not give all his researchers all of the information, but when you collect all the forecasts from the group to get your wisdom of the crowd you collect all the information dispersed across the group. So Tetlock used the algorithm to extremize the levels of confidence. An average of 80 percent confidence was boosted to 95 percent and 30 percent was downgraded to 15 percent. Tetlock’s research team won the contest and were found to be at least one-third more accurate than other research teams.

The U.S. intel community went on to carry out more than a dozen forecasting projects, including prediction markets, in which people can bet money or points on the outcome (betting on elections is often a better predictor of results than polling). Although there are now no active forecasting tournaments among U.S. intelligence agencies (Tetlock suggests this is because of they have exposed the inaccuracy of seasoned intel professionals), these inspired Cosmic Bazaar.

The results of Cosmic Bazaar are not available to the public, and it is still probably too early to assess its impact. However, some of the lessons on good judgement from Tetlock’s work in forecasting can be useful for anyone interested in better understanding what might come next.

First, Tetlock highlights many of psychological biases that were identified and named by psychologists Daniel Kahneman and Amos Tversky in the 1970s. As a former intelligence officer, I saw many examples of how these biases can hinder assessments. When I served in Afghanistan, my priority was gathering intelligence that could enable us to determine who was planting IEDs that targeted coalition troops—and where they were planting them. This task was made more difficult because our knowledge—of both battlefield and enemy—was fragmentary and incomplete. We operated in one province, Helmand. Our movements were limited by the rugged terrain and by the presence of enemy insurgents. We were further limited by how little we understood the language and the culture. This made us vulnerable to the fallacy of composition, illustrated by the group of blind men in the ancient Indian parable who encounter an elephant for the first time and are asked to describe it: one of them feels only the elephant’s tail, another only its trunk, and so on, and they end up confidently providing very different descriptions of what an elephant looks like.

Most people get their information from only a few sources—and rarely from sources that present conflicting perspectives or originate in countries other than their own. Therefore, most people, like the blind men feeling the elephant, only get a small part of the picture, yet when they have information about one part of an issue, they think they understand the whole issue. In everyday life, people tend to succumb to the fallacy of composition when they make generalizations about people who belong to groups to which they have very little exposure. This is an especially easy error to make because the media, knowing that extreme views are more likely to capture audience attention, tends to platform the most extreme group members.

We also had to fight the tendency to assume that tomorrow will be much like today, which made it difficult for us to anticipate key changes in our enemy’s strategy even when we had clues that such a change might be imminent. It was tempting to dismiss such clues because they didn’t fit what we knew about the current situation. Pointing out that things are about to change risks offending those who developed previous assessments. Agreeing with the prevailing opinion is a safer option for any single intelligence officer. And yet circumstances change more often than they stay the same. Tetlock highlights superforecasters’ ability to constantly re-evaluate their forecast as new information appears.

Among the many other biases Kahneman and Tversky describe, one of the most common is confirmation bias: we search for information—and recall it—in a way that confirms our existing beliefs. Internet search engines make this tendency worse because their algorithms direct us to information like the kind we’ve already viewed. In today’s polarized political climate another key bias is the affect heuristic: people tend to let their likes and dislikes determine their beliefs. Our political preferences tend to determine which arguments we find compelling. We tend to assume that our own limited, subjective experience points to absolute truths, while dismissing the idea that other people’s different limited, subjective experiences may provide a key piece of the puzzle.

This psychological bias often works in tandem with the temptation to indulge in magical thinking—imagining that things will turn out as we want even though we can’t explain how. This tendency was noted by David Omand, the former head of the UK Government Communications Headquarters (GCHQ) as one of the most common reasons for failing to accurately predict developments.

As well as avoiding biases, there are positive steps to take. To start with, Tetlock advises us to make sure we focus on questions we can answer, breaking seemingly intractable problems into tractable sub-problems. A big and elusive question should be broken down into several smaller and more tractable ones, known as “Bayesian question clusters.” In my interview to become an intelligence officer I got asked how many ties were sold in the UK in the previous year. The idea was to break down the big question into a series of smaller questions that would get me to an estimate of how many people would buy ties out of the total UK population and then how often they would buy them. We should also resist the urge to switch a hard question for an easy one. Such as when reviewing the intelligence assessments around the likelihood that Saddam Hussien had WMDs in the run up to the invasion of Iraq, we should not change the question to whether it was a good outcome from was it a good decision. Tetlock highlights that the conclusion that Saddam had WMDs was a fair assessment with what we knew at the time. It was the level of certainty attached to it where the mistake was made. A lower level of certainty may have meant the threshold for going to war was not met.

When you have the right questions, you then need the right data. If you want to get a more accurate picture from what you read or listen to, it makes sense to ask yourself the three questions we asked when we evaluated intelligence sources in Helmand: (1) What is the writer’s/speaker’s motivation?, (2) How good is their access to this information?, and (3) How much expertise do they have on the topic? Tetlock then highlights striking the right balance between inside and outside views. You should look at the particulars of any situation, but also ask, How often do things like this happen in situations of this sort? This sets a baseline that you then move up or down from—due, of course, to the particulars. For example, on the question of whether President Putin will use nuclear weapons in Ukraine, the starting point for any forecast should be that Russia have never used them in the past and that, in fact, no-one has since 1945. Then move up from this low base level, rather than jumping to a high probability of their use, due to anchoring onto the escalating use of artillery and the pressure on Putin due to repeated defeats. When you have made your forecast, constantly update it and do so incrementally.

Even when following the above advice, forecasters will still sometimes get it wrong. This is especially true when trying to predict something as complex as the future geopolitical or economic environment. The war zones I worked in, while not closed systems, nevertheless had fewer variables than the world at large. As we have become increasingly interconnected and interdependent, and new technologies have evolved speeding up the passage of information and decision making, the systems we are analyzing are becoming increasingly complex and unpredictable.

In complex systems, tiny differences in initial conditions can lead in unpredictable ways to large differences in outcomes over the long term: this is popularly known as the butterfly effect (shorthand for the idea that a butterfly flapping its wings in Peru could kickstart a series of events that ends up causing a tsunami in Japan). The effect was identified in 1961 by the mathematician and meteorologist Edward Lorenz. He was using a computer simulation to explore how weather patterns developed. He decided that he wanted to see a particular sequence of data again. To save time, he started that sequence from a point in the middle of its run, by entering the data that the first run had generated at that point. To his surprise, re-running the simulation this way produced a radically different outcome. Lorenz realized that this was because the computer program was set to round numbers off to six decimal places, but the printout of the results that he used to restart the sequence rounded numbers off to three decimal places. Lorenz had not expected this tiny difference to significantly alter the outcome. As he put it, “the approximate present does not approximately determine the future.”

Lorenz’s discovery is a key component of what came to be called chaos theory, which posits that, although the outcomes of complex systems can seem random, they are deterministic, and contain underlying patterns and feedback loops. Detecting these can help us forecast more accurately. How far in the future we can feel confident making predictions depends on how accurately we can measure a system’s current state, how rapidly the system’s dynamics tend to change, and how much uncertainty about the forecast we are willing to tolerate. Just as meteorologists can only forecast that there is a certain percentage chance of rain, intelligence professionals can only forecast probabilities—even though their governments (and the public) want certainties. Tetlock identified the impact of what he terms “butterfly dynamics and nonlinear systems” on the limits of predictability, noting that predictions declined towards chance at five years out. With the increasing complexity of global systems and rapid advance of technology that decline is getting steeper and the horizon closer.

Communications technology has radically increased the speed at which the global geopolitical and economic systems can change. For example, governments often make foreign policy announcements on social media, and the billions of people who use social media have instant access to that new information, which potentially changes their behavior. The 24-hour news cycle spins at such a dizzying speed that, by the time we analyze the potential impact of one event, our predictions have become outdated: stories about that event disappear, and the next hyperbole-laden headline is thrown into view. And that’s just the tip of the iceberg. Many of the variables that influence human affairs are machine-learning algorithms so complex that they have become black-box systems. Add to that our polarized political environment, in which people have trouble even agreeing on which facts are real, and clearly today we all live in Lorenz’s “approximate present.”

There is the hope that forecasting can be improved by crunching big data with increasingly powerful computers. Just as our ancestors searched for portents in the sky, we search for answers in the digital clouds. But when we throw large amounts of data into a computer, spurious connections and correlations will emerge. Machine learning can be useful in setting the baseline from the outside view, but weaker at understanding the particulars of the inside view, including the human psychology of key actors and emerging variables not present in previous situations. To forecast with any degree of confidence, we still need to provide the algorithm with the right parameters, understand the larger context, and ask the right questions. We cannot simply rely on artificial intelligence to identify the signal amid the noise, because it lacks our fuller understanding of the wider international context and the complexity of human behavior. Technology has a significant role to play, but any algorithm is only as good as the data put into it and the assumptions built into its structure (the questions that are posed in Cosmic Bazaar, for example). Assumptions underpinning the algorithms often reflect the expectations of their coders in much the same way as other methods of prediction.

I was lucky that, when I was providing intelligence to the soldiers in Helmand, they wanted the information. In today’s more politically polarized environment, when new facts emerge that are likely to influence future events, it’s becoming harder to convince policymakers and the public to use that information to change course.

We are in an era where the will of the people is preferred to the opinion of experts. This is partly due to the deliberate tactics of populist governments: they benefit by discrediting experts who question their policies. It is also partly due to many high-profile failures of experts, such as the lack of WMD in Iraq. The rigorous scoring and long-term tracking of accuracy in tournaments like Cosmic Bazaar can restore some credibility. But is also partly due to a misunderstanding of how intelligence works. Most people understand that weather forecasts are helpful even though they are probabilistic and therefore sometimes inaccurate, and they understand that it’s easier to predict tomorrow’s weather than next weeks. But many people fail to understand that intelligence forecasts are helpful for the same reason, even though they too are probabilistic (which means that low-likelihood events are bound to happen sometimes) and even though they too are better at predicting events in the short-term than in the long-term.

Forecasting helps us briefly lift our eyes from the chaotic present to survey the far horizon, plan for possible futures, and navigate the flood of information in which we’re all drowning. Today, when old certainties are fracturing, new technologies are proliferating across our multipolar world, and we face new and emerging threats from adversaries we do not well understand, we need good forecasting more than ever.

A guest post by

Andy Owen

Author interested in the history and philosophy of war. Work has appeared in Time, Aeon, The Spectator, Arc Digital, and elsewhere.

1 Comment

Jeffrey Quackenbush

Evening In the Blue Forest

Interesting article. I'd be curious to hear your views on the difference between computational and qualitative approaches to forecasting in our current intellectual climate.

There is a giant mismatch between the two. On one hand, there are a lot of people who excel at creating accurate, highly complex computational models, but who believe that the world itself consists in the same processes of computation and information flow, such that qualitative considerations are epiphenomenal. On the other hand, it can be argued that the world is full of qualities, and these should be addressed on their own terms, except that we haven't learn to handle them with the same rigor.

Expand full comment