Statistically speaking, there's nothing "normal" in accounting fraud detection. To understand what "not normal" means, let's first see what "normal" means from a statistics point of view.

The normal distribution is accredited to the German mathematician Friedrich Gauss (1777 – 1855). Teachers love to teach the normal distribution as it's relatively easy to explain using mean and standard deviation. Unfortunately, many real-world problems are not normally distributed. In many real-world problems, a standard deviation is meaningless and even dangerous (LTCM, a hedge fund, lost billions by relying on outliers based on the normal distribution).

In my view, the best example of a normal distribution is humans' IQ (Intelligence Quotient). It's assumed that the IQ is normally distributed with an average of 100 and a standard deviation of 15. Why is that so useful? Well, we know immediately that roughly 68% of the population falls within an IQ of 85 (100 – one standard deviation or 15) to 115 (100 + one standard deviation or 15).

Equally, for two standard deviations, the range is an IQ of 70 – 130, corresponding to about 95% of the population. Thus, if you have an IQ of 130 (congratulations!), you're roughly in the top 2.5% in the world (95% of the population have an IQ between 70 – 130, so the top is in the 2.5% and the bottom is in the 2.5%). IQ Distribution:

Applying the parameters (e.g., mean and standard deviations) to a dataset that is indeed roughly normally distributed is not the problem. The problem starts when treating a dataset as normally distributed when it is not. In other words, if a dataset is far from normally distributed (i.e., highly skewed), everything breaks if we apply the parameters of a normal distribution.

Here's an analogy of how this can go wrong: let's say you have a dataset of the top marathon runners in the world. For illustration purposes, let's assume the resting heart rate averages 40. Since those athletes train and compete at high heart rates, let's take that their heart rate standard deviation is 25 (which is high to the average). In other words, the athlete's heart rate distribution is far from normal. So what can go wrong in analyzing their heart rate? Let's assume we wanted to calculate the 95% heart range (i.e., two standard deviations). Adding two standard deviations to the resting heart rate equals 90 (40 + 25 + 25), which could make sense. However, if we want to calculate the lower heart range, we get a heart rate of – 10 (40 – 25 – 25). A negative heart rate is impossible. Thus, we clearly cannot apply the parameters of a normal distribution to the skewed dataset.

Unfortunately, in data analytics, we often see analysts applying the parameters of a normal distribution to skewed data.

In academic terms, we often use the term "parametric" for normal distributions. Parametric stems from normal distributions having parameters (i.e., mean and standard deviations). Nonnormal distributions are referred to as "non-parametric" as we have, in such situations, no parameters such as mean and standard deviations we can use.

In terms of distributions there are countless types of distributions. However, in my view, for most practical purposes, we can boil them down to the following three distributions:

- Normal distribution (no skew)

- Skewed distributions (negatively or positively skewed)

Generally, dealing with skewed distributions ranges from hard to extremely hard. In using Benford's Law, we're dealing with a type of positively skewed data. I refer to it as a "type" of positively skewed data, as from a statistics point of view, Benford's Law is an unknown type of distribution.

Benford’s Law’s skewed distribution gives us two challenges:

1. Defying the sample size required

2. Defying a threshold (fraud / not fraud)

Defying the required sample size based on Benford's Law was the topic of my doctoral thesis. In general, we want to be as confident as possible that the accounting data sample size is sufficient to a very high degree.

"Remember looking up tables in your statistics course?

Most statistics textbooks are outdated for today's data problems. Armed with a programming language (e.g., Python or R), we can simulate the most complex challenges in statistics."

Technically, we define a very high degree as a 99.9% confidence level. This confidence level is not confused with the risk level we get from the AI. Those are different levels. One measures the sample size confidence, and the other the risk of accounting fraud. However, they are related. The higher the confidence level for the sample size, the higher the quality of the AI prediction. Inversely, the AI prediction quality was also lower if the sample size confidence was low.

Predicting accounting fraud is challenging. Too many false positives (false alarms) can be a problem. Even a single false negative (missing accounting fraud) is a problem. Thus, we must provide AI with the best possible data to make a qualitative prediction.

In my research, I've seen studies trying to predict fraud in US elections where the sample size was too small. It's like going to a nightclub in London and then generalizing that all people living in London must be in their 20s.

How should we approach the question of sample size in auditing for accounting fraud? Unfortunately, the answer is not simple. Let's start with a "simple" use case where the data is normally distributed. Based on the Central Limit Theorem (CLT), we expect a sample size of approximately 30 for a normally distributed dataset.

"The Central Limit Theorem (CLT) states that a sufficiently large random sample from the population should approximately be normally distributed. However, as we see in this study, the sample size varies massively depending on the distribution of the population."

For a statistics professor, approximately 30 samples might suffice. However, we're dealing with real-world problems (accounting fraud), and "approximately" won't do. The stakes are too high.

For a sample size, we want to know the confidence interval. The answer is not straightforward. Remember, we're dealing with a "simple" normal distribution at this stage. Once we get to accounting fraud, it gets even more challenging!

Monte Carlo simulations come to the rescue again. If we take a sample from a normal distribution (1) with a mean of 100 and a standard deviation of 15 as expected in the IQ distribution and we want the range to be 99-101 (2) with a confidence interval of 95% (3), we get a whopping 840 samples.

Why do we get a much higher number than statistics textbooks? Reason: we define the range (99,101) and the corresponding confidence interval. It gets more challenging with the sample size determination for accounting fraud. However, we needed this building block to make the next jump.

“We're using Monte Carlo simulations to determine the sample size. Monte Carlo simulations heavily rely on the Law of Large Numbers (LLN): the distribution of increasingly large samples (i.e., 50,000 simulations) should converge to the underlying population distribution.”

In the following Python script, we run a Monte Carlo simulation to determine the number of accounting transactions required:

1) We always use 50,000 simulations. In the example above, we used a sample size of 2,000, giving us a (3) confidence level of only 99.69%.

In other words, we take a random sample (with replacement) of 2,000 and see if the sample is Benford conform. Afterward, we repeat those steps with a total of 50,000 simulations. In the end, we calculate how many samples were Benford non-conform 154 and divide this number by the total number of simulations. This gives us 50,000-154/50,000 = 99.69% confidence.

A confidence level of 99.69% is still a touch too low. We aim for the highest possible confidence level. Here’s what we got:

Based on our simulation, we require a sample size of 3,000 to be highly confident that the sample size is sufficient. Why do we aim for the highest possible confidence level? It's a cascading effect for the AI. Detecting accounting fraud is generally challenging, and AI prediction suffers if the sample size is not as large as possible.

Key insight: "To analyze accounting transactions for potential fraud, the absolute minimum number of accounting samples is 3,000 (needs to be at least one month of accounting data)."

While more accounting transaction samples are better to a certain point, the minimum number required is 3,000. Do we have a maximum number of pieces? Technically not. Technically, we can analyze billions of transactions. However, we can't visualize massive data anymore (most visualization tools break). As visualization is a big part of an investigation, we recommend a maximum of 50,000 transactions for visualization purposes.