Detecting COVID reporting fraud with Benford’s Law

Using a tactic to detect tax fraud, this analysis attempts to determine the accuracy of each country's daily COVID reports.

David Head
Sponsored by
No items found.

Show Notes

Since the coronavirus pandemic started, I’ve wondered how accurately countries have been reporting data on their cases. I read as much as I could about it, but still never felt satisfied.

Then I was watching a Netflix show called Connected, and they started talking about a numerical phenomenon called Benford’s Law. Apparently governments use this law to determine the probability of tax fraud, rigged elections, and more.

The premise is that in any real-world dataset, the frequency of the numbers’ leading digits will follow a consistent distribution using a base-10 logarithm. About 30% of the time the first digit will be the number 1, 17% the number 2, and so on, as you can see from the chart below.

Image for post

Every real-world dataset from stock prices to population sizes should follow this distribution closely.

So after I finished Connected, I decided to test Benford’s Law against COVID data. I grabbed the raw daily COVID cases & deaths dataset from WHO to run an analysis against.

Method:

1. Extract first digit from each country’s daily ‘New Cases’ and ‘New Deaths’ reporting.

Image for post

2. Calculate the probability of occurrence of each first digit, 1 through 9, on the ‘New Cases’ and ‘New Deaths’ datasets, by country.

Image for post
Image for post

3. Run a correlation of each country’s reporting distributions against the Benford’s Law distribution.

4. Rank each country by the average of their ‘New Cases’ and ‘Daily Deaths’ correlations.

Result

The results were both reaffirming and surprising.

Image for post

Over 70% of countries (154, specifically) daily COVID reports had above a 90% correlation with the Benford’s Law distribution. A handful were above 99%!

Image for post
Average correlations across all countries

This leads me to believe that the true number of new cases and deaths across the globe follows Benford’s Law. It also leads me to believe that most countries are reporting pretty accurately.

In other cases where the correlations were weaker, there was an interesting pattern of either a country’s ‘New Cases’ or ‘Daily Deaths’ report being highly correlated, and the other being not so much so.

Image for post

If you look at the sparkline distribution charts, you can see that the number 1 appears less frequently than other numbers on for a handful of countries, when it should at least be the most frequent.[1] Additionally, the rest of the sparkline distributions on the low correlations are bumpy, and in some cases have the higher first digit numbers more frequently than the lower ones.

I won’t speculate as to why one of the correlations is so weak compared to the other, but I am extremely curious to learn more.

I’ll let you take it from here to explore the data yourself.

Closing

Hopefully this Benford’s Law analysis was as interesting to you as it was for me. Since I only spent an hour or so putting together the report after I finished the Netflix show, some of the statistical methods and analysis could certainly be better.

Feel free to explore the analysis yourself, make a copy, and audit the formulas and methods. If you see a way to improve anything, message me on Twitter and I’ll update it here.

Spreadsheet: COVID Benford’s Law Correlation Analysis


[1] The left-hand side of the line is the frequency of number 1, whereas the right-hand side is the frequency of 9, with other numbers spread across the middle.