The Receiver Operating Characteristic, or ROC, curve is a tried-and-true diagnostic for binary classification models. Its simple, elegant design allows analysts to easily choose among multiple classifiers as well as among any number of probability cutoffs by plotting the True Positive rate against the False Positive rate. In this post, we’ll dive into the key concepts, uses, and interpretations of ROC curves in predictive modeling, as well as the diagnostic’s incredible origin story.
In the 1940s at the height of World War II, radar technology was just in its infancy. At that time, radar (then known as an acronym for RAdio Detection And Ranging) devices projected radio waves from a transmitting antenna which were then analyzed and processed by a receiver. Objects of sufficient size – such as enemy airplanes or ships at sea – within range of the radar could be detected when the radio waves reflected off them and back to the receiving antenna. However, then as now, a human was required to interpret the raw data and make decisions about what actions to take. In order to make the life or death decisions associated with the possibility of incoming bombers, WWII radar operators could manually adjust the amount of “gain” on their receivers. With gain set to zero, no signal is received. Increasing gain allows more signal (enemy planes or ships) to be detected, but also increases the amount of noise that gets picked up and possibly misinterpreted as a true signal (for example, a flock of birds or severe weather). At low levels of gain, noise is very weak and therefore unlikely to be mistaken for true signal, however, only the strongest signals (very close or very large aircraft) can be picked up, increasing the chances that some true signal will be missed. As gain is turned up, weaker signals can be detected, but more noise is also picked up. Increasing gain further eventually becomes counterproductive as signal becomes indistinguishable from noise, so the ability to find the optimal balance of signal to noise so that the greatest number of true threats were detected while also minimizing false positives was a critical skill. Throughout the war, radar operators were evaluated and assessed based on their ability to find this balance, leading to further developments in signal detection theory and giving rise to the term still used today: Receiver Operating Characteristics.
After the war, it became apparent that similar methods could be used in such fields as experimental psychology and in medicine – notably in radiology. Let’s imagine a radiologist is looking at a set of 10 images from 10 different patients and needs to determine which patients have cancer. In our example, we’ll imagine that 3 of these patients actually has cancer. The radiologist could diagnose all 10 with cancer and not miss anyone, but would actually be misdiagnosing 7 of her patients, possibly leading to expensive and unnecessary treatment for a disease those 7 people don’t have (this scenario is known as a False Positive, or Type I error in statistics terminology). On the other hand, she could say none of the patients have cancer and claim a 70% accuracy rate, although her misdiagnosis of the 3 patients who actually do have cancer is arguably a more serious error (this example is commonly known as a False Negative, or Type II error).
This concept of balancing the ratio of correct and incorrect decisions is at the heart of the ROC curve. To explain a bit further, we can illustrate the possible outcomes of a binary (yes/no, true/false, cancer/no cancer, etc.) classifier with something called a Confusion Matrix:
|Yes||True Positives (TP)||False Positives (FP)|
|No||False Negatives (FN)||True Negatives (TN)|
From this 2×2 matrix, we can calculate several statistics to evaluate a binary classifier and inform our understanding of ROC curves:
- Sensitivity (aka Recall, Hit Rate, True Positive Rate or “TPR”):
TP / TP+FN
- Specificity (aka True Negative Rate or “TNR”):
TN / FP+TN
- False Positive Rate (aka “FPR”):
FP / FP+TN = 1-Specificity
TP+TN / TP+FN+FP+TN
- Note: Several other measurements can be calculated from the table above which are not central to our discussion of ROC curves, but which are nonetheless important to be familiar with, including: Precision (aka Positive Predictive Value), Negative Predictive Value, Prevalence, False Discovery Rate, and False Omission Rate, among others.
If we continue to use our radiology example from above where 3 of the 10 patients actually have cancer, in the scenario where our radiologist diagnoses all 10 patients with cancer, the confusion matrix and relevant statistics would look like this:
In the example above, our radiologist has decided it is more important to not risk missing any true cases of cancer by favoring Sensitivity at the expense of Specificity, and ultimately, Accuracy.
On the other hand, if our radiologist concludes that 0 of the 10 patients have cancer, she is in effect favoring Specificity at the expense of Sensitivity, while still not achieving terribly impressive Accuracy. In this case, our confusion matrix and relevant statistics would be as follows:
The basic premise of the ROC curve, then, is to plot the True Positive Rate (“TPR”, Sensitivity) against the False Positive Rate (“FPR”, 1-Specificity) as a way to visualize the trade-off between the different types of errors. However, the real strength of the ROC curve lies in the fact that it plots these values for a variety of probability thresholds to help analysts visualize what might happen if on the one hand they favor not missing any signal while letting in lots of noise, versus risking missing some signal in favor of minimizing noise (think back to our WWII radar example of the operator adjusting the gain on his receiver).
Let’s look at a couple example ROC curves displaying different thresholds to illustrate how this effects Sensitivity and Specificity:
The example in Figure 1 below was drawn using the statistical programming language R, and displays Sensitivity on the y-axis, Specificity on the x-axis, and plots a single point with a label representing the binary decision threshold – in this case 0.5 – along with the Sensitivity and Specificity values (51.6% and 99.9%, respectively). The blue line represents the relationship between TPR and FPR at various probability thresholds. The grey diagonal line represents random guessing and is sometimes called the “line of no-discrimination.” If we were to flip a coin to classify a binary event instead of building a predictive model, we would expect our results to lie on this line. For our purposes, we’re looking for a ROC curve that’s well above the grey diagonal line and as close to the upper left-hand corner of the plot as possible, an indication that our classifier is doing a better job than random guessing. In addition to these components of a ROC curve, a metric known as the Area Under the Curve (AUC) is often calculated alongside the plot itself. As the name implies, the AUC measures the space under the curve and quantifies the extent to which the curve follows the upper left-hand section of the plot. With possible values ranging from 0-1, an AUC of 0.5 would follow the grey “no-discrimination” line, and an AUC of 1 would travel up the y-axis and then turn 90-degrees across the top of the plot, indicating 100% Sensitivity and 100% Specificity.
In this example, our threshold is set at 0.5, or in other words, a balance between signal and noise. We might choose this threshold in cases where the probability of a given event was known to be relatively high within the population, and where we could afford to compromise between the TPR and FPR. At this threshold, our classifier has high Specificity (is good at minimizing noise) but relatively poor Sensitivity (has a hard time determining true signal).
Unfortunately, the probabilities of events of interest in predictive modeling are typically very rare (a fact that makes them so well suited to modeling in the first place!) and analysts can seldom afford such compromises between correct and incorrect classification, meaning we need to be cognizant of the difference between True Positives and False Positives in order to make an informed decision about our tolerance for both. Below, Figure 2 plots the exact same curve as in Figure 1 above (including the same underlying data and model), but we have lowered the probability threshold considerably to reflect a very rare event – perhaps a rare disease. Note the change to Specificity and Sensitivity between Figures 1 and 2:
At a probability threshold of 0.003, this classifier achieves very high Sensitivity and Specificity, meaning it’s very good at detecting signal and minimizing noise.
It’s very important to note at this point that adjusting the probability threshold to optimize Sensitivity and Specificity may not improve overall accuracy, it simply “shifts” the incorrect classifications between either False Positive (aka Type I error) or False Negative (aka Type II error). As an example, see the two confusion matrices below:
In both examples above, the total population of 100,187 people were classified with 99.9% accuracy (thanks in large part to the True Negatives – this is a very rare event), however Sensitivity varies significantly. In Table 4, Sensitivity suffers due to the large number of False Negatives, whereas in Table 5 this ratio is much higher even though the model makes the same number of misclassifications in each example (105).
So where does that leave us? How should we evaluate our predictive models knowing a choice needs to be made between two different kinds of misclassifications? The answer relates directly to the domain for which we’re building the predictive model in the first place. If our intention is to evaluate a classifier that will be used to diagnose cancer or in airport security screening, it makes sense to accept the cost of more False Positives (Type I errors) in return for the benefit of detecting as many tumors or security risks as possible. Likewise, in nonprofit fundraising, it can be argued that we should favor False Positives over False Negatives in most instances. If we build a model to predict donor acquisition, the cost (both literal and figurative) of predicting that someone will become a donor who ends up not doing so is much lower than the cost of overlooking someone who would otherwise donate simply based on a misclassification. There are important caveats to this, and the statement above may not hold true in all cases, but hopefully this is an illustrative example of the importance of thinking through all the possible outcomes of a binary classifier, knowing how to use the ROC curve and AUC statistic to evaluate different models, and understanding the trade-off between Type I and Type II errors. At the very least, perhaps you learned some WWII trivia. Happy modeling!