As most readers of this blog are likely aware, two of the most widely used measures of corruption perceptions—Transparency International’s Corruption Perceptions Index (CPI) and the Worldwide Governance Indicators (WGI) corruption index—are composite indicators that combine perceived corruption ratings from a range of different sources (including private rating agencies, NGOs, international development banks, and surveys of firms and households). The CPI takes a simple average of the available sources for each country; the WGI uses a somewhat fancier “unobserved component model” (UCM) which assumes that each source’s score is a noisy signal of the “true” level of perceived corruption; the UCM differs from a simple average in a few ways, perhaps most notably by giving less weight to “outlier” sources, though in practice the WGI and CPI are highly correlated, and the WGI’s creators report that the results for the WGI turn out not to change very much if one takes a simple average rather than using the WGI.
These composite indicators have a number of well-known problems, which I won’t bother going into here. Rather, the main purpose of this post is to introduce readers to an alternative index, developed by Samuel Standaert at Ghent University, which he calls the “Bayesian Corruption Index” (BCI). Standaert introduced the BCI in a 2015 article, but so far as I can tell it has not attracted much attention. The BCI certainly doesn’t solve all the problems of the traditional aggregated corruption perceptions indicators (more on this below), but it’s definitely an improvement, and deserves wider use. Let me first say a bit about how the BCI differs from the WGI, why I think it’s an advance over the WGI and CPI, and what some of its limitations are.
First, with respect to how the BCI is constructed: The paper is quite technical, and for those who (like me) are not so comfortable with fancy statistics, a bit challenging, but I think the basic intuition is actually pretty straightforward, so let me try to give a somewhat simplified, non-technical explanation.
Maybe the best way to do this is to start with the WGI. As noted above, the way the WGI is constructed is more or less like this: For country X in year Y, we have perceived corruption scores from bunch of different sources—let’s say four of them, Z1, Z2, Z3, and Z4. The scores are all different, but we assume those differences are due to measurement error, so we construct a statistical model that assumes that each source’s score is a function of the “true” level of perceived corruption plus some random error (“noise”). (We also assume that the random error is symmetric rather than skewed, and that the random errors from each source are uncorrelated.) If our assumptions hold, then aggregating sources is better than using any single source, because we’re using more information; the random errors will tend to offset, though we can and should use them to calculate the amount of statistical uncertainty—the margins of error—associated with our estimated scores.)
But—and here is Standaert’s crucial insight—for each different year, the WGI score for each country is calculated separately, using only the sources for that year, even though we have very strong reasons to believe that the level of perceived corruption is “sticky”: there’s a lot of inertia, and it doesn’t change much from year to year. So when we try to estimate the level of corruption in country X in year Y, and we use only sources from year Y, as opposed to year Y-1, Y-2, etc. (or, for that matter, year Y+1, Y+2, etc.), we’re throwing away information. We’re also implicitly treating year-to-year changes as if they were separate estimates of two entirely different things. Now, a challenge here is that while we shouldn’t assume that the true level of perceived corruption in one year is not correlated with the level the previous year, we also don’t want to assume that the level of perceived corruption in any given country is constant across all years; after all, one of the questions we might want to ask is whether perceived corruption is increasing or decreasing. Standaert’s solution is to assume that for each country, there’s some constant degree of “stickiness” (that is, year-to-year correlation); this can differ across countries, though not across years, and each country’s “stickiness” parameter is estimated in the statistical model, rather than assumed. The statistical techniques needed to do the calculations is complicated, and frankly over my head (it’s a form of Bayesian inference using Gibbs sampling), but I hope the intuition is clear.
So, what difference does this make? What’s the payoff of using this technique over the traditional WGI? A few things: First, because each estimate uses more data, the margins of error are smaller. Second, this technique allows for more extrapolation backwards and forwards in time. Third, and from my point of view most interesting and important, the BCI may be more appropriate for over-time comparisons than the WGI or CPI, because the statistical procedure builds in persistence in corruption levels and builds in the possibility that year-to-year jumps or dips in a country’s scores on various sources may be due to random error rather than actual changes. On this point, it’s worth noting—as Standaert’s paper points out—that the correlations between the WGI, CPI, and BCI when comparing across countries are extremely high (above 0.96 for all three pairwise comparisons), but the correlations between these indexes when comparing within countries are very low (around 0.35 for WGI-BCI and WGI-CPI, and around 0.2 for BCI-CPI).
These differences mean that for some countries, year-to-year changes in WGI scores that are statistically significant at conventional levels turn out not to be significant when using the BCI (because those changes are driven by simultaneous changes in several sources that the BCI deems mostly random error), while for other countries, the BCI identifies meaningful over-time changes that do not show up as statistically significant changes in the WGI (because the BCI, by using more information, estimates these perceived corruption levels more precisely). According to the table in Standaert’s article, between 2000 and 2010:
- For three countries (Greece, Italy, Venezuela), both the BCI and WGI detect statistically significant increases in perceived corruption, while for another three (Georgia, Macedonia, Qatar), both the BCI and WGI detect statistically significant decreases in perceived corruption.
- For two countries (Eritrea and the UK) there was a statistically significant worsening of perceived corruption on the WGI that is not significant at conventional levels when using the BCI, while for another three countries (Rwanda, Serbia, and the UAE), there was a statistically significant improvement in WGI scores that turns out not to be statistically significant when using the BCI. (For one additional country—Liberia—both indexes find an improvement, though the WGI reports a higher level of statistical confidence than does the BCI.)
- There are eight countries (Colombia, Czech Republic, Hungary, Kuwait, Iceland, Slovakia, South Africa, and the US) that had statistically significant increases in perceived corruption on the BCI that were not picked up as significant on the WGI, and three countries (Gambia, Iraq, Saudi Arabia) that had statistically significant decreases in perceived corruption on the BCI that were not picked up as significant on the WGI.
These differences are not trivial. If we assume for the moment that the BCI is indeed more accurate than the WGI, then using year to year changes on the latter index to measure changes in perceived corruption between 2000 and 2010 would have generated five false positives and 11 false negatives (or 12, depending on how one counts Liberia)—and the errors go in both directions (wrongly finding both increases and decreases, and wrongly missing both increases and decreases).
Now, while I do think the BCI represents an improvement over the WGI (and CPI), certainly for use in assessing over-time trends, I want to be clear that this index also has important limitations, and should not be treated as necessarily giving us the “correct” answer. There are a bunch of issues one could raise, but let me just highlight a couple:
- First, an important difficulty not only with the BCI but with the WGI and the CPI is the concern that the measurement errors associated with the different sources are in fact correlated with one another. The concern about so-called “correlated measurement error” may seem like a technical statistical point, but it’s a really big deal, and far too often overlooked (even though the creators of these indexes frankly acknowledge and discuss it). Here’s the issue in a nutshell: As discussed above, all of these aggregation techniques—whether it’s the simple averaging of the CPI, the unobserved component model of the WGI, or the Bayesian estimation techniques employed in the BCI—assume that each of source’s perceived corruption score is composed of the “true” level of perceived corruption plus some measurement error that is both random and independent across sources. But what if the random error isn’t independent across sources? This could occur if the people doing the evaluations for each source are reading and incorporating past scores from the other sources, or from the aggregate indicators, or if they’re all (over-)reacting to the same events or reports in the media. If these errors are correlated, than the estimated standard errors will be too small—maybe much, much too small. This is a very difficult problem: After all, if we sources moving in roughly the same direction, it’s hard to tell if this is because of a change in the “true” level of perceived corruption, or whether it’s because of some form of correlated error. The BCI, for all its virtues, doesn’t solve this problem. My hunch—based on a bit of anecdotal evidence, but not much more—is that there’s probably quite a bit of correlated error out there (everyone reads the CPI and WGI these days, for example), which means the BCI, like the WGI and CPI, probably underestimates the margins of error, and may therefore exaggerate the number of statistically significant changes within countries and differences across countries.
- Second, as I’ve argued elsewhere in the context of the CPI, the best aggregation techniques in the world can’t solve the problem of inter-temporal non-comparability if the underlying sources themselves are not on a constant scale over time. It’s hard to believe that these scales really are constant across long stretches of time, so that, say, a score of 4 out of 7 on the International Country Risk Guide’s corruption index (one of the sources used in the CPI, WGI, and BCI) means the same thing in 2016 as it did in 1986. In fact, Standaert’s paper supplies what to me seems like indirect evidence of this problem, even though his interpretation is quite different. Standaert uses the BCI to see whether there is any worldwide trend in the level of perceived corruption between 1995 and 2012, and finds that there is indeed a statistically significant increase in perceived corruption over this period, and this drop is particularly pronounced between 2005 and 2012. Should we interpret this as evidence that corruption (or even perceived corruption) got markedly worse—on a constant scale—between 2005 and 2007? Possibly – but I’m skeptical. To me, this is at least suggestive evidence that what’s actually going on is a shift in the scale that the underlying raters are using: a kind of “grade deflation” in which it’s gotten harder to get top marks for corruption control. Maybe I’m wrong about that, but the point is that we really have no way of knowing (at least not without much more information), which should make us skeptical not only of the worldwide trend but also of the trend findings for any individual country.
Those caveats—along with other common observations about corruption perception indexes, such as the obvious but important point that perceptions can differ from reality—are important. Yet aggregate perception indicators will likely continue to play an important role in anticorruption research and practice, and so long as that’s true, we should be constantly working on improving those indicators, even as we remain mindful of their inherent limits. In that spirit, I think the BCI is an important step forward. I hope it gets more recognition, and that Standaert and others continue to maintain and update it.