## Strategic and human factors are beyond formulas.

When you bring AI to businesses, math is the simple part. It is simple because it is objectively decidable using universally accepted standards. But a lot of important decisions do not fall into this category. When such decisions are about to be made, the debate is on: A dozen stakeholders bring as many opinions to the table, each backed by sensible arguments. To prevent unproductive chaos, this is the time when leadership has to step up and, sooner or later, these issues inevitably become C-level decisions.

For an AI project to become a success, mathematics, economics, strategy, and psychology all have to be taken into account. To illustrate how this can play out, we will in this post look at Machine Learning applications in the German car insurance sector.

Think of a Machine Learning system trained to validate car insurance claims. How car insurance works differs from country to country. In Germany, a claim is typically accompanied by a damage assessment from a certified expert, or a quotation or invoice from a repair shop. Based on these documents, the Machine Learning system classifies the claim into one of two categories: Either it recommends direct payout or, if the invoice seems to be in part or completely unjustified, technically incorrect, or problematic in any other way, it routes the case to a human subject-matter expert for further investigation.

As AXA and Controlexpert, to heavy-weights in the German car insurance business, report in a joint statement, the ratio of perfectly legitimate claims is about 40% [1]. Therefore, an AI that enables human experts to focus on the remaining 60% of *relevant* cases, promises a lot of saved time and money. To fully realize this benefit the system has to be configured to serve the user’s needs best.

And that’s where the problems start. What is best for the user?

One might be tempted to think that, surely, in the highly mathematical and quantitative world of Machine Learning, there must be this single magic metric that can be used to find the optimal configuration. There is not.

But before we see why, let us first start simple and see what *can* be objectively said.

# How to quantify classifier performance

# (This section covers basic concepts of binary classifiers. If you know what a confusion matrix and a ROC curve are, feel free to skip this section.)

Of course, there *are* objective ways to evaluate the performance of Machine Learning models. In fact, a system such as our claims classifier is a particularly simple case, a so-called binary classifier. Binary, because it answers either-or-questions with two possible outcomes: yes or no, zero or one, thumps-up or thumps-down. Or, in our concrete scenario: Does the insurance claim look fishy or not?

A classifier is trained using historic case data. To evaluate its performance on new, *unknown* cases, you hold out a part of said data and do not use it for training. Instead, once the model is trained, you use it to test the model. From the model’s perspective this data is as unknown as completely new data would be, but you, the tester, know how each sample has been decided in the past by a human expert, so you know how it *should* be classified.

With this knowledge, you can group the classifier’s test set results into four categories.

1. If the classifier recommends the claim for further investigation and according to the record, the claim was indeed objectionable, then you have a **true positive (TP)**.

2. If the classifier recommends the claim for direct payout, and according to the record the claim was fine, you have a **true negative (TN)**.

3. If the classifier recommends investigation but the record shows the claim was fine, then this is a **false positive (FP)**.

4. If the classifier recommends direct payout but the record shows the claim was objectionable, then this is a **false negative (FN)**.

The numbers in the four categories are commonly represented in the form of a two-by-two matrix, the so-called *confusion matrix*:

Although the confusion matrix consists of four numbers, TP, FP, FN, and TN, only two of them are actually independent. This is because the data itself, or more specifically the test set, has a certain number P of samples that are actually positive and a certain number N of samples that are actually negative. Therefore, no matter how well the classifier works, in the end, we always have TP + FN = P and TN + FP = N. Two equations with four unknowns leave us with two degrees of freedom. A common way (but not the only one) to parameterize these degrees of freedom is by means of the **true positive rate (TPR)** and the **false positive rate (FPR)**.

Mathematically, they are defined as

Starting from these definitions, you can easily convince yourself with very little algebra that TPR and FPR are enough to determine all four TP, FP, FN, and TN, by noting that

In reality, though, a case is hardly ever 100% clear. Instead, decisions are made with more or less confidence, depending on the evidence- but also on the personal inclination (read: bias) of the individual decision-maker.

Imagine that our classification task would have to be done by a human expert rather than an AI. Further, imagine there were two candidates for the job. On the one hand, there is the ultra-skeptical subject-matter expert. For her, no claim is above suspicion. Case details notwithstanding, she simply flags every single claim for further investigation. Interestingly, without even considering the evidence for one second, she is certain to get a definite proportion of cases right: She correctly recommends investigation of every single unjustified claim. At the same time, though, she wrongly recommends every single correct claim for closer examination, as well.

The second candidate is her exact opposite. His trust in the claimants knows no limit. Therefore, he recommends every single claim for direct payout. This expert also has *some* success to show, as he treats every justified claim correctly – but every unjustified claim wrongly.

Between the two extreme positions represented by these fictitious characters, one can imagine a full spectrum of more or less strict decision-makers. To represent this spectrum, we can introduce a „strictness parameter“ p, more formally known as the *discrimination threshold*, whose value can be chosen between 0% and 100%.

If, given the case data, a claim is fishy with probability 70%, then an expert with strictness p would recommend further investigation if *p*<70% and direct payout if p>=70%.

This is exactly how our classifier works internally. It computes a probability and translates it into a binary decision according to a configuration parameter p, the discrimination threshold.

Thus, TPR and FPR, and with them, all four numbers in the confusion matrix depend on the threshold value. How TPR and FPR vary with changing threshold can be understood by looking at a ROC curve [2]. You obtain this curve by plotting the points (FPR(p), TPR(p)) for all values of p between 0% and 100% into a FPR-TPR coordinate system.

**Figure 1.** The classifier represented by the blue curve is superior to the classifier represented by the green curve.

**Figure 2.** Two (or more) models can be combined to a single ROC curve corresponding to their individual ROC curves‘ convex hull.

The ROC curve can be used to compare two classifiers. In general, a classifier is better than another, if its TPR is greater than the other’s for every FPR. In the ROC graph, the superior classifier’s ROC curve is everywhere *above* the other classifier’s ROC curve (see Figure 1).

So you can systematically compare various models until you found the one with the best ROC curve. It might turn out that the best curve does not come from a single classifier but from the combination of two or several different classifiers. What matters is that in the end, you will always have a single best ROC curve. (In the case of a classifier combination, this curve is the upper convex hull of the ROC curves of each individual classifier [3], see Figure 2.)

At this point, we are still in a simple world. You can objectively select the model whose ROC curve beats the competition and you are sure you brought the best player to the field.

Unfortunately, having selected the best classifier, you are far from done. You still have to choose the best *point* on the curve. The question then is: Which threshold serves the user best?

# The accuracy optimum

(In the remainder of this post, we will show a ROC curve and other characteristics of a trained binary classification model. The model is to be understood as an illustration, only – it is *not* trained on actual insurance data but on the „Adult“ dataset [4].)

At first glance, finding the best threshold value seems to be a trivial task. Just go for maximum accuracy, the point where as many predictions as possible are correct. Every other choice leads to more errors. Surely, we want to avoid that, right?

Accuracy is formally defined as the ratio of correct predictions among all predictions, that is

When we calculate ACC for each point on the ROC curve of our example, we find a maximum accuracy of 85% (see Figures 3 and 4). Is that the best we can do?

**Figure 3.** In our example, the best possible accuracy is 85%.

It *could* be, but only in one particular case, that is, when both kinds of errors, false positives and false negatives, were equally painful from the user’s business perspective. In most cases, though, both errors have a significantly different impact, and, as we will see, our car insurance claims example is no exemption.

**Figure 4. **Accuracy, cost, and full confusion matrix along our example’s ROC curve. CST has been rescaled for enhanced visualization, so that the maximum possible cost is 1.

# The cost optimum

The damage resulting from wrong decisions is not always easy to quantify. Weighing false positives and false negatives against each other is a classical dilemma: Is condemning an innocent person to prison (false positive) worse than letting a murderer walk free (false negative)? By how much?

Fortunately, in the business world, things are not quite as dramatic and with some effort, the damage due to wrong decisions can be quantified in terms of money. This leads us to cost estimation. To assign price tags to the two possible errors, false positive and false negative, we have to examine, what exactly happens in case of each error.

In case of a false positive, a completely correct bill is unnecessarily sent to an expert, who then spends some of his valuable time trying to find an error where there is none. To simplify the calculation, let’s say this process takes him 12 minutes and the all-in costs of his labor are 100 € per hour. Then, each false positive wastes 20 €.

In case of a false negative, an invoice with unjustified items is paid out. That means, the insurer unnecessarily loses the corresponding monetary value of said items. The average value of unjustified claims can be calculated from the historic case data, that has been checked and corrected by human subject-matter experts. Let’s assume, it was 100 € per case.

Of course, this is an oversimplified representation of cost estimation, but it suffices to get the idea: In the end, you will have a definite monetary value of each error type. In our case it looks like this:

or, equivalently, in terms of TPR and FPR:

Since FP and FN (or, equivalently, TPR and FPR) depend on the value of our threshold parameter p, we can plot the cost as a function of p, that is, CST(p). It is instructive, to plot cost and accuracy in the same diagram. To this end, we show cost not as an absolute value but as a percentage of the maximum possible cost (see Figure 5).

**Figure 5.** Accuracy and cost plotted against the discrimination threshold. For reference, the full confusion matrix is shown in the lower right.

The operating point with optimal cost can also be determined geometrically from the ROC curve. From the above equation linking CST to TPR and FPR, we note that

This means that points of equal cost are arranged on straight lines with slope 0.2⋅(N/P). These lines are known as *iso-performance lines *[3]. The cost increases with the error rate, that is, in the direction of the lower-right corner of the ROC diagram. Accordingly, the point of lowest cost is found where the ROC curve is tangent to the uppermost iso-performance line (see Figure 6).

**Figure 6. **Our example’s ROC curve (blue) with several iso-performance lines (yellow), each corresponding to a constant cost. Cost increases vertically to the iso-performance lines, from upper left to lower right.

**Figure 7.** The cost difference between the accuracy optimum and the cost optimum can lead to significant waste.

We find the optimal cost of 29% at an accuracy level of 76%. In contrast, the cost at maximum accuracy 85% level is 40%. Thus, an increase in accuracy by 9 percentage points increases the cost per case by 11 percentage points (see Figure 7). Operating the system at the significantly higher optimal accuracy effectively *destroys* your client’s money!

Accuracy (and alternative scores such as F1 or Matthews correlation coefficient) have no value by themselves. Only when linked to actual costs, based on a thorough cost estimation, these metrics become meaningful.

# The strategic decision

Now that the threshold value corresponding to the cost optimum has been found, should you configure the AI, accordingly? That depends on how such a configuration fits in with the user’s broader strategy. Consider, for example, the different viewpoints of two companies operating in the insurance business.

The first company is a classical insurer. Their potential benefit from the AI is increased cost-efficiency of the claims validation process. The second company is a service, providing subject-matter expertise *to* insurers.

Many insurers have outsourced their expertise for various damage types to specialized companies such as CarExpert, ControlExpert, Dekra, Eucon, and others [5], paying them to identify unjustified claims. At first glance, it appears that an expert company could also simply apply the AI to maximize cost efficiency. However, they are bound to an additional constraint: Their clients do not care about unnecessary work on the expertise provider’s end (resulting from false positives), but they care *a lot* about lost money due to unjustified claims going undetected (resulting from false negatives). In terms of the metrics defined above, the insurer measures the provider’s service quality in terms of the TPR and is not willing to accept anything but a very high value. As we have seen above, the cost optimum’s TPR is somewhere around 90%. From the insurer’s viewpoint, that value is unacceptable, as it would imply that 10% of the unjustified claims went unnoticed.

Therefore, instead of simply operating at the cost optimum, the provider has to strike a balance between cost efficiency and delivering a sufficiently high TPR. This leaves them with two basic choices. They can prioritize the reduction of their operations cost and, at the same time, reduce the price of their services to stay attractive on the market despite lower TPR. Or they can deliberately prioritize product quality over price, for example, to maintain the reputation of a high-quality service and aim for long-term customer satisfaction. If they go for the latter choice, the Machine Learning solution has to be operated at a very high TPR (see Figure 8). This way, the provider can benefit from a moderate automation rate while maintaining a very low error rate, close to human-level performance.

On the downside, operating at high TPR leads to an increased cost per case and a reduced accuracy, in comparison to operating at the cost optimum. Practically all of this accuracy loss can be attributed to false positives, following the logic that more work for the human experts is acceptable, as long as most of the unjustified claims are found.

It is not possible to mathematically prove which choice is the right one. How to position yourself in the market is a decision beyond any formula. And the difficulties do not stop here. There is a long list of other factors to be taken into account, such as company strategy, compliance and regulatory issues, political and ethical considerations, and many more. What all these points have in common is that they are hard – if at all – to quantify. These difficulties cannot be resolved by mathematics alone and it takes strong and smart leadership to find the right way in this complex situation. This is the moment where experienced executives (and consultants at their side) can shine.

**Figure 8.** Prioritizing TPR reduces false negatives but costs both, money and accuracy.

# Conclusion

Reality is not as clear cut as a Kaggle competition and an AI’s performance in real-world applications cannot simply be auto-scored. Success along the strategic and human dimensions of the problem is beyond formulas. In this sense, AI is no different from any other major change in business operations. Implementing AI in real-world settings is a process involving a multitude of stakeholders who evaluate the success or failure of a solution based on very different metrics that often cannot be mathematically formalized. Indirect costs, resulting, for example, from dissatisfied employees or customers, can show up with months or years of delay, making the associated risks hard to quantify. For these reasons, technical performance metrics will often play an important but ultimately secondary role in the change management process.

(Die deutsche Übersetzung finden Sie hier: *Die optimale KI-Konfiguration ist eine Frage der Perspektive*)

# References

[1] TEAM POWER, C€-Profile 2019, p. 21, retrieved May 30, 2020

[2] T. Fawcett, An introduction to ROC analysis (2006), *Pattern Recognition Letters*, **27**, pages 861–874

[3] F. Provost and T. Fawcett, Robust Classification for Imprecise Environments (2001), *Machine Learning*, **42**, pages 203–231 (also available on arχiv)

[4] Adult dataset (1996), provided through Dua, D. and Graff, C. (2019), UCI Machine Learning Repository

[5] K. Braunwarth, Wertorientiertes Prozessmanagement von Dienstleistungsprozessen (2009), Doctoral thesis, section II.B1-8