Benford’s Law: An Introduction for Auditors

Benford's Law has been the benchmark in detecting accounting fraud for a long time. However, we need a more powerful tool as accounting fraud gets more sophisticated. The tool is AI. But to outperform Benford's Law using AI, one needs expert knowledge of Benford's Law.

History

Benford's Law is named after the American physicist Frank Benford (1883 – 1948). In the days before computers, data analysts had to pore over large books of algorithms on good, old-fashioned paper.

As he did so, Benford discovered something interesting. In every data set—whether it was calculating the areas of rivers or observing death rates—a particular pattern repeated. In all cases, Benford noticed lower numbers occurred in higher frequencies than high numbers.

If you Google Benford's Law, you will likely discover the "first-digit test,” which observes the digits 1 to 9. For practical purposes, the test is too simple as it observed only nine digits and is not granular enough (sensitive to manipulation).

However, our research uses a more advanced “first-two-digit test," which observes ninety digits, or 10 to 99. This advanced test was developed by Prof. Dr. Mark Nigrini.

Why? It works.

The ACFE (Association of Certified Fraud Examiners) is the world's leading fraud examiner association. A recent paper called "Using Benford's Law to Detect Fraud" recommended the first two-digit test for efficient audit samples. Our research confirms their findings.

The more advanced test lets us drill down to a more granular level to uncover potential accounting fraud. With that, we get more confidence that our findings are correct.

The first-two-digit test formula:

Figure 1: Formula for Benford's first-two digits distribution

The formula expresses Benford’s observation that lower numbers occurred in higher frequencies than high numbers.

An accounting example:

For example, with legitimate accounting transactions, we expect transactions starting with the digit 30 to occur 1.42% of the time (Log10(1+(1/30))). In other words, in 100,000 accounting transactions, we expect approximately 1,420 (i.e., 1.42%) of transactions to start with the digit 30.

The formula “converts” mathematically that we expect lower numbers to occur with higher frequencies than high numbers:

Digit 10: Log10(1+(1/10)) = 4.14%

Digit 11: Log10(1+(1/11)) = 3.78%

Digit 12: Log10(1+(1/12)) = 3.48%

Digit 13: Log10(1+(1/13)) = 3.22%

Digit 14: Log10(1+(1/14)) = 3.00%

Digit 15: Log10(1+(1/15)) = 2.80%

…

Digit 99: Log10(1+(1/99)) = 0.44%

The log transformation gives us the expected probability for each digit from 10 to 99. In mathematics, many problems in probability theory can be solved by simply counting the number of different ways that a certain event can occur. The mathematic theory of counting is often referred to as combinatorics, which is part of discrete mathematics.

When we visualized Benford's Law (10-99) expected distribution (gray line), we can see immediately the higher expectation for lower digits and the lower expectation for higher digits.

Figure 2: Calculation for the vertical bins 10 - 99

The vertical bars represent the observed distribution of the dataset. As you can see, they sometimes deviate slightly from the expected distribution (gray line). That's normal. Based on research done by Prof. Dr. Mark Nigrini, not even natural, legitimate data follows Benford's expected distribution perfectly.

Even legitimate accounting data, as seen below, deviates quite a bit from Benford's expected distribution (gray line). That's normal.

Figure 3: Legitimate accounting data (looks universally the same)

One of the challenges with Benford's Law is determining how much deviation is still conform and when it becomes non-conform. Without going into the technical details, we also recommend a scientific approach to assess non-conformity, such as MAD (Median Average Deviation) or through simulation. What doesn't work is using an arbitrary deviation or a statistical test such as the Chi-square test, which assumes a normal distribution (the expected distribution is far from normal).

In AuditedAI, the visualization is the second most important part – after the risk prediction. Why? With a bit of experience in visualization, you see what’s wrong with the data most of the time just by looking at it. Let’s look at the following two visualization variations.

Both visualizations are based on the same (!) data:

Again, both datasets are use the same data. However, the first dataset is based on a traditional histogram, while the second is based on Benford’s Law. With the traditional histogram, we cannot see the outliers for accounting data starting with the digit 45.

Technically, the traditional histogram uses the amounts as bins, while the modified histogram uses bins based on Benford’s Law. The difference is striking! As an auditor, with the modified histogram, you can see immediately which accounting transactions look suspicious (i.e., all starting with the digit 45).

The Three Biggest Mistakes in Using Benford's Law

We’ve spent an almost unhealthy amount of time researching and applying (using code) Benford's Law. Coding requires you to be meticulous. If you’re not, the program doesn't run correctly or not run at all. When you're coding, spotting your mistakes and those of others is easy.

This post isn't about bragging. However, anyone spending several years researching a topic should become a world expert in his domain.

We're standing on the shoulders of titans. We couldn't have developed AuditedAI without the extensive work done by others in the past. Thus, this is our potential contribution to push research further. Who knows what the next generations will be able to solve?

Mistake #1: Aggregating Data

There's a staggering difference between researching and applying it to the real world. We've never seen this mistake in research, but when you're applying code to run the software, it can easily sneak in. We've seen PhDs in statistics committing this mistake.

A simplified example: we have three companies, A, B, and C. As an auditor, analyzing all the accounting data for companies A, B, and C together leads to a high probability of a wrong conclusion. The correct way is to explore each company's accounting data separately. Namely, we investigate all the accounting transactions for company A, followed by company B, and finally company C. Or in any order you want, but do not combine all accounting transactions for all companies. That’s a grave mistake.

In data analytics lingo, analyzing the data for each company is called a “group by.” In the case of Benford’s Law, if we don’t group by company, we’re committing a fatal error.

Mistake #2: Defining Non-Conformity

All data deviates from Benford's Law expected distribution, even legitimate data. However, how much deviation is normal and determining that the data is non-conform can be challenging. What’s the wrong approach? Essentially, any approach that is not based on science is wrong. For example, adding an arbitrary range of +/- 10% for determining non-conformity is wrong. Using a statistical measure such as the Chi-square test, which assumes a normal distribution, is wrong. The expected distribution of Benford’s Law is far from normal; it’s highly skewed to the right.

So, what’s the correct approach? Essentially, any method is based on science. Science means we can create a hypothesis, test it, and make the result repeatable. For example, Prof. Dr. Mark Nigrini analyzed many natural datasets (legitimate data) and set a statistical threshold based on MAD (Mean Absolute Deviation). An alternative approach is via simulation. Through large-scale simulations, we can test a hypothesis and make it repeatable.

Mistake #3: Ignoring the Sample Size

Ignoring the sample size is a common mistake, even among experts. For example, I've seen experts (i.e., PhDs in statistics) applying Benford's Law to potential election fraud with only 200 samples. That's not enough samples! We need a relatively high number of samples before we can use Benford's Law. While the Central Limit Theorem (CLT) still holds for Benford's Law, it's substantially larger for normally distributed datasets (CLT often manifests in only 30 samples). The sample size was the focus of my doctoral thesis and based on my research, ignoring the sample size often creates too many false positives (i.e., false alarms).

Benford’s Law’s skewed distribution gives us two challenges:

1. Defying the sample size required

2. Defying a threshold (fraud / not fraud)

Defying the required sample size based on Benford's Law was the topic of my doctoral thesis. In general, we want to be as confident as possible that the accounting data sample size is sufficient to a very high degree

Technically, we define a very high degree as a 99.9% confidence level. This confidence level is not confused with the risk level we get from the AI. Those are different levels. One measures the sample size confidence, and the other the risk of accounting fraud. However, they are related. The higher the confidence level for the sample size, the higher the quality of the AI prediction. Inversely, the AI prediction quality was also lower if the sample size confidence was low.

Predicting accounting fraud is challenging. Too many false positives (false alarms) can be a problem. Even a single false negative (missing accounting fraud) is a problem. Thus, we must provide AI with the best possible data to make a qualitative prediction.

In my research, I've seen studies trying to predict fraud in US elections where the sample size was too small. It's like going to a nightclub in London and then generalizing that all people living in London must be in their 20s.

How should we approach the question of sample size in auditing for accounting fraud? Unfortunately, the answer is not simple. Let's start with a "simple" use case where the data is normally distributed. Based on the Central Limit Theorem (CLT), we expect a sample size of approximately 30 for a normally distributed dataset.

"The Central Limit Theorem (CLT) states that a sufficiently large random sample from the population should approximately be normally distributed. However, as we see in this study, the sample size varies massively depending on the distribution of the population."

Figure 4: Visualizing Accounting Fraud [Research AuditedAI]

Figure 5: Benford's Law still obeys CLT, but at a high number of samples (visualization using 4,000 samples)

For a statistics professor, approximately 30 samples might suffice. However, we're dealing with real-world problems (accounting fraud), and "approximately" won't do. The stakes are too high.

For a sample size, we want to know the confidence interval. The answer is not straightforward. Remember, we're dealing with a "simple" normal distribution at this stage. Once we get to accounting fraud, it gets even more challenging!

Monte Carlo simulations come to the rescue again. If we take a sample from a normal distribution (1) with a mean of 100 and a standard deviation of 15 as expected in the IQ distribution and we want the range to be 99-101 (2) with a confidence interval of 95% (3), we get a whopping 840 samples.

“We're using Monte Carlo simulations to determine the sample size. Monte Carlo simulations heavily rely on the Law of Large Numbers (LLN): the distribution of large samples (i.e., 50,000 simulations) should give us reliable confidence levels.”

In the following Python script, we run a Monte Carlo simulation to determine the number of accounting transactions required:

Figure 6: Monte Carlo Simulation in Python for Sample Size Requirement in Benford's Law

1) We use 50,000 simulations. In the example above, we used (2) a sample size of 2,000, giving us a (3) confidence level of only 99.69%.

In other words, we take a random sample (with replacement) of 2,000 and see if the sample is Benford conform. Afterward, we repeat those steps with a total of 50,000 simulations. In the end, we calculate how many samples were Benford non-conform 154 and divide this number by the total number of simulations. This gives us 50,000-154/50,000 = 99.69% confidence.

A confidence level of 99.69% is still a touch too low. We aim for the highest possible confidence level. Here’s what we got:

Figure 7: Required Sample Size for Benford's Law (first-two digit version)

Based on our simulation, we require a sample size of 3,000 to be highly confident that the sample size is sufficient. Why do we aim for the highest possible confidence level? It's a cascading effect for the AI. Detecting accounting fraud is generally challenging, and AI prediction suffers if the sample size is not as large as possible.

Why use AI instead of Benford’s Law for Accounting Fraud Detection?

In our view, Benford's Law is the benchmark in detecting accounting fraud. To outperform Benford's Law, one needs a very strong knowledge of Benford's Law. Without this expert knowledge, no successful AI can be built.

Benford's Law is a powerful approach to detecting certain fraud types that follow Benford's distribution. However, once you've researched and applied it for many years as we have, you detect specific weaknesses. One of those is borderline cases. Another is "Advanced Accounting Fraud," which refers to fraudsters mimicking Benford's Law to stay undetected. In other words, with the popularity of Benford's Law, we can expect fraudsters to use Benford's Law to their advantage. There are publicly available programs available precisely that.

In short, we trained an AI, and we can train it to detect accounting fraud impossible to detect for Benford's Law while making sure it never underperforms Benford's Law. Thus, by definition, our AuditedAI consistently outperforms Benford's Law in the long run.