## Two ways to visually analyze a binary classifier.

Recently, while writing an article on how to optimally configure a binary classifier involving, among other factors, ROC analysis, I came across a promising alternative known as TOC analysis. In this post, I will introduce you to the concepts behind both visual tools and discuss their similarities and differences.

# Motivation

When it comes to evaluating the performance of binary predictors, the receiver operating characteristic (ROC) curve has been a staple for decades and actually long predates the Machine Learning era (for an early example from the 1950s, see Peterson et al. [1]). It allows us to evaluate and compare the performance of classifiers and is, therefore, a useful tool for model selection.

That a classifier can at all be fully characterized by a curve drawn in two-dimensional space, that is, on your monitor or a piece of paper, is based on the fact that the two-by-two confusion matrix has only two degrees of freedom for any given test set. Each point on a ROC curve determines the corresponding confusion matrix if and only if you know your test set’s composition, that is, you know how many positive and negative samples it contains. However, the ROC graph itself does not contain the composition as visual information.

Given the long history of ROC analysis, this shortcoming has rather recently been addressed by the introduction of the total operating characteristic (TOC) through Pontius and Si in 2014 [2]. A TOC diagram contains the full ROC information and additionally allows you to read off the total information, that is, the test set’s composition and all four entries of the confusion matrix, for each point on the curve.

Before we see how that works, let us quickly review the basic concepts and notation of binary classification.

# Binary classification basics

In binary classification, model performance is commonly evaluated by comparing its predictions for data points from the test set to the known correct results, that is, to the labels of your test set. The prediction for each data point falls into either of four categories: true positive (TP), true negative (TN), false positive (TP), or false negatives (FN) and the number of samples in each category is represented in the *confusion matrix*:

Although the confusion matrix has 4 components, only two are actually independent. This is because each actual test set has a certain fixed number P of samples that are actually positive and a certain number N of samples that are actually negative. Therefore, irrespective of a particular classifier’s performance, in the end, we always have TP + FN = P and TN + FP = N.

That makes two equations with four unknowns and leaves us with two degrees of freedom. A common way (but not the only one) to parameterize these degrees of freedom is by means of the *true positive rate* (TPR) and the *false positive rate* (FPR). Mathematically, they are defined as

Starting from these definitions, you can easily convince yourself with very little algebra that TPR and FPR are enough to determine all four TP, FP, FN, and TN, by noting that

In many cases, classifiers are based on probabilistic classification. This means they compute probabilities for each class from the features of a data point x. In the case of binary classification, these are the two probabilities p(1|x) and p(0|x). Since the sample has to be in either of the two classes, that is, p(1|x)+p(0|x)=1, it suffices to look at one of the two probabilities, say, p(1|x). To obtain a binary prediction, a discrimination threshold is used to discretize the continuous probability p(1|x) into either of the two classes:

Thus, TPR and FPR, and with them, all four numbers in the confusion matrix depend on the threshold value. This dependence is what both, ROC and TOC graphs, are designed to visualize.

# ROC and TOC graphs

Both, ROC and TOC graphs are tools to visualize classifier performance for all possible threshold choices in a single graph. However, they are based on two different coordinate systems.

A ROC curve is plotted into an FPR-TPR coordinate system, that is, you plot (FPR(threshold), TPR(threshold)) for all threshold values between 0% and 100%. A TOC graph, on the other hand, is plotted into a (TP+FP)-TP coordinate system, that is, you plot the points (TP(threshold) + FP(threshold), TP(threshold)) for each threshold value.

The TOC graph’s main appeal is that you can read off the complete confusion matrix for every point in TOC space. This is achieved by not only plotting the curve, but also a surrounding parallelogram-shaped box with the corners (0, 0), (N, 0), (N+P, P), (P, P).

**Figure 1:** The ROC curves of two classifiers (blue and green lines) and the curve representing uninformed (random) classifiers (orange dots).

**Figure 2:** For every point on a TOC curve, you can read off all four components, TP, FP, FN, and TN, of the confusion matrix, while having all information from a ROC graph available, as well.

We will discuss the nature of this parallelogram in more detail, below, but for now, we note that the distance from any point on the TOC curve to the box’s left boundary corresponds to FP, the distance to the right boundary to TN, the distance to the top to FN and the distance to the bottom to TP.

Unlike a ROC graph that does not allow to reconstruct the confusion matrix without knowledge of P and N, which are not contained in the graph itself, a TOC graph contains the confusion matrix *for every given threshold choice*. Furthermore, you can easily read of the test set’s size and composition, revealing, for example, skewness in your data that would remain “hidden” in a ROC graph.

# ROC space vs. TOC space

The ROC curve is bound to a square region known as ROC space, whose points correspond to all possible values of TPR and FPR between 0 and 1 [3]. This square is cut in half by the diagonal from TPR=FPR=0 to TPR=FPR=1. The points on this diagonal represent so-called uninformed classifiers. Such a classifier simply categorizes data points at random, completely disregarding its actual feature values. An uninformed classifier at TPR=FPR=0.7, for example, would classify 7 out of 10 data points as positive and 3 out of 10 as negative.

The extreme ends of this diagonal are home to two very special deciders. In the lower-left, at TPR=FPR=0, we have a classifier that simply categorizes every single data point as negative. The classifier across, in the upper-right corner of ROC space, categorizes every single data point as positive.

**Figure 3:** A ROC curve (blue) in comparison to the performance of uninformed (random) classifiers (orange diagonal). The four extreme corners of ROC space are: (1) the perfect classifier, (2) the all-positive classifier, (3) the worst possible classifier, and (4) the all-negative classifier.

The other two corners of the ROC space are equally interesting. In the upper-left, we have the perfect classifier. It assigns the correct category to every single data point of the test set, corresponding to a diagonal confusion matrix with TP=P, TN=N, and FP=FN=0, that is,

You could render the perfect classifier perfectly imperfect by flipping each of its responses. This worst possible classifier assigns the wrong category to each data point, resulting in the confusion matrix TP=TN=0, FP=N, and FN=P, that is

In ROC space, this corresponds to the lower-right corner, TPR=0, FPR=1.

TOC space differs from ROC space in two fundamental ways. First, it is not a square but a parallelogram, and, second, its shape depends on the test set’s composition. This is a direct consequence of the choice of axes because unlike the rates TPR and FPR, the absolute numbers TP and FP are not contained to the range from 0 to 1. To understand the parallelogram, choose any fixed value of TP. Your choice is then automatically a lower bound of TP+FP, the TOC graph’s abscissa (aka x-axis), since FP is a count and thus never negative. The upper bound of TP+FP is determined by your choice, as well. It is TP+N, as FP cannot be greater than N. These lower and upper bounds of TP+FP form the inclined left and right boundaries of the parallelogram. TP itself is also a count and limited to the range from 0 to P, determining the lower and upper boundaries of the parallelogram, respectively. Consequently, the whole TOC space can be embedded in an N+P by P rectangle.

Despite all differences, TOC space is just a (non-uniformly) scaled and sheared version of ROC space. Any point in ROC space can be mapped to TOC space using the linear transformation

This transformation is a composition of scaling

and shearing

Therefore, we can go from ROC space to TOC space by use of

**Figure 4:** The four extreme corners of TOC space are (1) the perfect classifier, (2) the all-positive classifier, (3) the worst possible classifier, and (4) the all negative classifier. At point (5), the prevalence (number of positives) is estimated correctly. Left of (5) (gray area) TP+FP<P, so that the prevalence is underestimated, right of (5) TP+FP>FP, so that the prevalence is overestimated. (Source: The author)

Due to the simple nature of this transformation, the essential geometric structure of ROC space is carried over to TOC space. Thus, just like ROC space, TOC space is also halved by the diagonal representing uninformed classifiers. The corners of TOC space preserve their meaning from ROC space, as well: perfect classifier in the upper left, worst possible classifier in the lower right, all-positive classifier in the top right, and all-negative classifier in the lower left (see Figure 4).

Other derived properties stay intact under the transformation, as well. Let’s consider, for example, the area under the ROC curve (AUROCC), a typical metric to describe the ROC curve as a whole using a single number. The area under the TOC curve (AUTOCC) is directly proportional to the area under the ROC curve (AUROCC) and can be interpreted the same way. We have

so that the fraction of ROC space (the 1-by-1 square) under the ROC curve is the same as the fraction of TOC space (the N-by-P parallelogram) under the TOC curve.

There is a lengthy proof of this relation in the original publication of Pontius and Si [2] but I find it more straightforward to consider that only ϕₛcₐₗₑ affects the area (ϕₛₕₑₐᵣ is area-preserving) and its Jacobian is N⋅P, so that it is evident that AUROCC is enlarged accordingly, under the transformation.

There is one noteworthy point in TOC space that cannot be identified in ROC space. It is the point (P, TP), shown as point (5) in Figure 4. It is the point on the curve directly under the upper end of the left boundary of the parallelogram. A classifier operating at this point produces precisely the same ratio of positive to negative predictions as there are positive and negative samples in the test set. Such a classifier can be said to correctly represent the actual prevalence. All classifiers left of this point are underestimating the ratio of positives while all classifiers to its right are overestimating it.

**Figure 5:** The fraction of the parallelogram under the TOC curve (blue area) compared to the whole parallelogram’s area (blue+orange part) equals the area under the ROC curve. Compare the colored areas in the ROC graph shown in Figure 3.

# The shape of TOC space

While ROC space remains geometrically static under variation of test set composition, TOC space changes strongly. Pontius and Si propose “to enhance visual clarity” by rescaling the graph so that on paper (or your monitor), both, the TP axis and the TP+FP axis have the same length [2].

However, it helps to understand the concept of TOC space to draw the TP and TP+FP-axes with the same scale and watch how TOC space unfolds and collapses with changing test set composition.

For any non-empty test set, there are five possible situations:

- 0=N<P

2. 0<N<P

3. 0<P=N

4. 0<P<N

5. 0=P<N

For illustration, let us consider a toy classifier trained on the “Adult” data set [4] and compile a master test set with equally many positive and negative samples. We can then create all five possible situations by controlled subsetting of this set. In particular, we start at N=0, P=1552 (situation 1), increase N (situation 2) until we reach N=P=1552 (situation 3). From there, we decrease P (situation 4) until we eventually reach P=0, N=1552 (situation 5). Figure 6 illustrates how TOC space changes its shape due to these changes in test set composition.

**Figure 6:** Animation showing the shape of TOC space for various test set compositions. Note the collapse from two to one dimension in the two extreme cases N=0 and P=0.

**Figure 7:** For 0=N<P (Situation 1), TOC space is one-dimensional. It corresponds to the diagonal from (0, 0) to (P, P).

Let us discuss situations 1 through 5 in more detail.

For N=0 < P, TOC space is one-dimensional! Its left and right boundaries collapse into a single diagonal line from (0, 0) to (P, P) (see Figure 7). Although this might seem strange at first glance, it is sensible, since in absence of negatives, the performance of a classifier is in fact one-dimensional, completely determined by a single parameter, its TPR.

Going away from this extreme situation, we increase the number of negatives in the test set. The diagonal separates into a left and right boundary, opening up a finite area, the “normal” two-dimensional TOC space.

However, TOC space retains the shape of a rather narrow diagonal strip, because its right boundary starts at TP+FP=N, far to the left of TP+FP=P, where its left boundary ends (see Figure 8).

**Figure 8:** For 0<N<P (Situation 2), TOC space is a rather narrow diagonal strip.

**Figure 9:** For 0<N=P the left boundary ends directly above the starting point of the right boundary.

Increasing the proportion of negatives further, we then approach the situation where P=N, that is, negatives and positives are perfectly balanced in the test set. Now, the right TOC space boundary starts directly underneath the endpoint of its left boundary (see Figure 9).

When we increase the number of negatives beyond this point, we come to a situation where the left and right boundary diagonals are no longer above each other and there is a rectangular area between (P, 0), (N, 0), (N, P), and (P, P). Consequently, TOC space now appears rather broad (see Figure 10).

Eventually, we approach another extreme: The whole test set is filled exclusively with negatives. Again, two boundaries of TOC space collapse two a single line. This time, it’s the upper and lower boundaries, so that TOC space becomes a horizontal line segment reaching from 0 to N (see Figure 11). Again, this collapse is sensible, as a classifier’s result is now purely made up of TN and FP and its performance is completely defined by a single parameter, its FPR.

We conclude our round-trip through TOC space by noting that for the (practically irrelevant but pathologically interesting) case of a completely empty test set with N=P=0, TOC space would collapse even further, becoming a single point.

**Figure 10:** For 0<P<N, TOC space is rather broad.

**Figure 11:** A 0=P<N, TOC space is a one-dimensional line segment from 0 to N.

# Discussion

The TOC graph is a useful addition to the statistician’s toolbox. Readers used to ROC analysis can learn to interpret a TOC diagram quickly because both representations share many properties.

A TOC curve contains strictly more information than its ROC counterpart. However, the information sparseness of a ROC graph can also be seen as advantageous. As Fawcett puts it in his *Introduction to ROC analysis* [3]: “*One advantage of ROC graphs is that they enable visualizing and organizing classifier performance without regards to class distributions or error costs. This ability becomes very important when investigating learning with skewed distributions or cost-sensitive learning. A researcher can graph the performance of a set of classifiers, and that graph will remain invariant with respect to the operating conditions (class skew and error costs). As these conditions change, the region of interest may change, but the graph itself will not.*”

Therefore, ROC curves will remain relevant for certain purposes where TOC is not a suitable substitute.

When it comes to library support, the ROC curve has a clear advantage. While there are ready-to-use library functions for ROC analysis in any major Machine Learning or statistics library, the support for TOC is still rather limited, due to the relatively short time since its inception. There is a TOC R package curated by the authors of the original TOC publication. However, if you are willing to go the extra mile, a function to plot TOC diagrams from probabilities is quickly implemented in your favorite programming language.

Finally, one has to consider, that graphs are, in the end, a means of communication. In the statistics and Machine Learning communities, ROC analysis is a well-known standard, and ROC graphs are directly understood. In contrast, TOC analysis is still relatively young and might still perplex part of your audience. If you have a 10-minute time slot at an important conference, you would not want to spend 5 minutes explaining an unusual diagram type (TOC), in particular when there is an alternative (ROC) that your audience will get right away.

Although TOC analysis is still far away from being widely adopted, I hope this post convinced you that it is a tool worthwhile to try out when analyzing a classifier’s performance.

# Comments?

If you have any comments or remarks, feel free to add them to the original posting on Medium.

# References

[1] W. Peterson, T. Birdsall, and W. Fox, *The Theory of Signal Detectability* (1954), Transactions of the IRE Professional Group on Information Theory, **4 **(4), pages 171–212.

[2] R. G. Pontius Jr and K. Si, *The total operating characteristic to measure diagnostic ability for multiple thresholds* (2014), International Journal of Geographical Information Science **28** (3), pages 570–583

[3] T. Fawcett, *An introduction to ROC analysis* (2006), Pattern Recognition Letters **27**, pages 861–874

[4] *Adult dataset* (1996), provided through Dua, D. and Graff, C. (2019), UCI Machine Learning Repository

### Über den Autor

Dr. Michael KöpfIT-Organisationsberater und Transformationsmanager

Als leidenschaftlich-analytischer Organisationsberater begleite ich komplexe IT-Projekte mit mehreren hundert Teilnehmern. Ich berate das Top-Management. Dank mehrjähriger Erfahrung als Entwickler, Teamleiter und Software-Architekt weiß ich auch, was die Arbeitsebene bewegt. Das macht mich effektiv, denn echter Change geht durch alle Ebenen.