Mysteries of the ROC curve
Most Data Scientists would have come across the ROC (receiver operating characteristic) curve. It is often used to evaluate the performance of a binary classifier by measuring AUC (area under the curve) and to find an optimal probability threshold to get to the sweet spot in the sensitivity-specificity tradeoff. However, both theoretically and practically there is more to it than meets the eye. This article doesn’t deal with the very basics of the ROC curve. It rather addresses some of the underlying mysteries. This article will address the following precise questions:
- Why does it make sense to plot an ROC curve with Y-axis as sensitivity and X-axis as 1-specificity?
- Is ROC curve appropriate for multi-class settings?
- Since in multi-class settings it is more reasonable to measure sensitivity (aka recall) and precision instead of specificity, what if we plot a similar curve between recall and 1-precision?
Answer to each question builds on top of the answer to the previous question, thus readers are encouraged to follow the flow of the article.
Let’s start with some notation and definitions first.
- TP = True Positives
- FP = False Positives
- TN = True Negatives
- FN = False Negatives
- Sensitivity = Recall = TPR(True Positive Rate) = TP/(TP+FN)
- 1-Specificity = FPR(False Positive Rate) = FP/(FP+TN)
- 1-Precision = FP/(TP+FP)
Let’s draw a confusion matrix, which may help the readers in understanding the later analysis:
- Why is ROC curve plotted with Y-axis as sensitivity and X-axis as 1-specificity?
While plotting an ROC curve, we vary the probability threshold for positive class and measure sensitivity and 1-specificity on a data set for all the chosen thresholds. The following figure shows the instances falling to the right of the threshold are predicted as positive class and those falling to the left are predicted as negative class:
Now we ask the question: what else varies when we vary the probability threshold. Let’s say we increase the threshold. Given the threshold is sufficiently increased, some data points that were earlier predicted as positive class would now be predicted as negative class. So the count of positive labels would decrease – this means either TP will decrease, or FP will decrease or both TP and FP will decrease. How would this affect sensitivity? The numerator of sensitivity is TP, thus the numerator may decrease. What happens to the denominator of sensitivity? The denominator is TP+FN, which is in fact the count of total positive data points as per the ground truth. Are we changing the ground truth, No! So the denominator will not change. Thus on increasing the threshold, the sensitivity would either decrease or wouldn’t change. Now let’s understand how would 1-specificity be affected. The numerator for 1-specificity is FP, thus the numerator may decrease. And again since the denominator is FP+TN, which is in fact the count of the negative data points as per the ground truth, the denominator would remain unchanged. Thus on increasing the threshold, 1-specificity would either decrease or wouldn’t change. This leads to few interesting properties in an ROC curve. It is never the case that sensitivity increases while 1-specificity decreases, and vice versa is also impossible. This causes an ROC curve to look like:
Note the boxes with arrows pointing to different regions in the curve contain the threshold values. One can see that the curve never has a negative slope. Another interesting property is the ideal scenario occurs when specificity is 1 and sensitivity is also 1. This occurs when the topmost-leftmost point lie on the curve. This ideal case can be achieved in only one way – the curve starts at origin goes vertically straight up till the point when sensitivity is 1 and then goes vertically right till the point when 1-specificity is zero. This leads to the concept of area under the curve, which in the perfect case is 1. It can also be concluded that good classifiers have AUC close to 1. This also suggests that to maintain a good sensitivity-specificity tradeoff we would like to choose such a threshold that gives a point in the ROC curve which is closest to the topmost-leftmost corner. Now let’s turn to the second question.
- Is ROC curve appropriate for multi-class settings?
Let’s ask a simple question – how specificity behaves in a multi class setting. Let’s do a thought experiment with 5-class classifier where we have a classifier that has achieved a sensitivity of 92% for all the 5 classes. And let’s assume the remaining 8% error in sensitivity is randomly distributed in the confusion matrix. For a case where ground truth as 100 labels in each of the classes, we can expect to get a confusion matrix of the following type:
What is the specificity for Class1? Note that since every class other than Class1 becomes the negative class, specificity = TN/(TN+FP) = 396/(396+4) = 99%.
Just because TN >> FP we find that it is very easy to get high specificities for multi-class settings. And this effect of large specificities co-occurring with small sensitivities becomes more profound as number of classes grow even further. Now the question is how would this affect the ROC curve. Since most of the times specificities would be large, the curve starts at origin and tends to move vertically upwards (because 1- specificity would be really low). The curve can only go up, as we discussed the properties of ROC curve in answer to the question 1. This would give a misleading impression that the classifier is performing better than it is in reality. See the following example of an ROC curve from a 9-class classifier:
In the above case, the concerned class has sensitivity and precision of 95% and 94% at 0.5 threshold but the ROC curve is misleadingly showing it as a case of a near perfect classifier because specificity is more than 99%. This brings us to the next question.
- Since in multi-class settings it is more reasonable to measure sensitivity (aka recall) and precision (instead of specificity), what if we plot a similar curve between recall and 1-precision?
We will try to understand what happens when we plot a curve between recall (aka sensitivity) and 1-precision by varying thresholds. Let’s do a similar analysis as we did in the 1st question. We already know that on increasing the probability threshold the sensitivity would either decrease or wouldn’t change. What would happen to 1-precision when probability threshold is increased. 1-precision = FP/(TP+FP). We can rewrite this expression as 1/(1+(TP/FP)). Thus the ratio TP/FP governs the behaviour of 1-precision (Note that unlike sensitivity and 1-specificity we don’t have the luxury to say that the denominator is constant). And we have already seen in the answer to question 1 that on increasing the probability threshold, either TP will decrease, or FP will decrease or both TP and FP will decrease. This means that depending on whether TP decreases or FP decreases or both decrease and by how much degree, TP/FP and thus 1-precision may increase or decrease. An example of such a case is the following:
Observe the bottom-left corner to notice such a behaviour. This has certain consequences – AUC calculation is undefined in such a case because there is no well defined area under the curve due to the potential upward zig-zag in the curve. Since we can’t have AUC, we can’t predict the classifier performance simply as we would have, had we computed the area. But does that mean such a curve (which can be called PR curve) is totally useless? No, because we can still try to find a probability threshold that gives the sweet spot between precision and recall, which again happens to be closer to the topmost-leftmost corner of the curve.