The sensitivity and specificity are lying to you

sensitivity and specificity are lying spin snout

When we talk about diagnostic tests, we are obsessed with sensitivities and specificities. In many papers, they are the only numbers reported. When we discuss diagnostic tests at conferences, sensitivity and specificity are frequently the only numbers mentioned. Even on First10EM, I have frequently given sensitivity and specificity the leading role when discussing diagnostic tests. Based on the “spIN” and “snOUT” mnemonics, sensitivity and specificity seem straight forward. We have been taught that sensitivity will help us rule disease out and specificity will help us rule disease in. In turns out, that is a complete lie. Most of us don’t really understand what sensitivity and specificity mean, and it has been hurting our patients. 

Imagine a patient with a possible subarachnoid hemorrhage (SAH). Based on their presentation, you figure they have about a 10% chance of ultimately being diagnosed with SAH. Imagine a new decision rule that has a 90% sensitivity for subarachnoid hemorrhage. Given that we want to rule this disease out, that sounds promising. The rule only has a 10% specificity, but you figure you can work with a few false positives if the 90% sensitivity helps you rule out SAH. So, how much does the 90% sensitivity decrease your patient’s chance of having a subarachnoid hemorrhage if they pass the rule?

Let’s do the math, but to keep it simple, I will perform the calculations using pictures. Imagine 100 patients coming to the emergency department with headaches. Based on your assessment, you think 10 (or 10%) of these patients will rule in for subarachnoid hemorrhage. 

Your decision rule is 90% sensitive, so it will identify 9 out of the 10 patients with disease. That sounds promising. It sounds like this rule could be helpful. If we focus on just the sensitivity, it looks like we will only miss 1 patient in 100, which might be good enough.

This is usually where we stop in medicine. When attempting to rule a disease out, we only look at the sensitivity. That is certainly how I was taught. However, let’s consider the impact of the 10% specificity. There are 90 healthy patients in this cohort, and the 10% specificity means that 81 of them will fail the decision rule, or be false positives. Now, we can start to see the problem.

There are 90 patients with ‘positive’ tests and 9 of them have a subarachnoid hemorrhage. In other words, if you fail the decision rule, you have a 10% chance of having a subarachnoid hemorrhage. There are 10 patients with ‘negative’ tests, and 1 has a subarachnoid hemorrhage. In other words, if you pass the decision rule, you have a 10% chance of having subarachnoid hemorrhage.

This was a test with a 90% sensitivity. It was supposed to help us rule out disease. Instead, we have the exact same chance of disease before and after the test, no matter what the result!

When this was first explained to me, my mind was absolutely blown. Everything I had been taught about sensitivity and specificity was a lie. Sensitivity is supposed to help rule disease out (snOUT). How is it possible that a test with a 90% sensitivity (significantly better than many of the tests we use every day in emergency medicine) didn’t change the patient’s chance of disease at all?!

It turns out, you can’t consider just the sensitivity or just the specificity in isolation. Although that is exactly how we talk about these measures, they are absolutely useless on their own. In order to figure out whether a test is helpful, you have to consider both sensitivity and specificity together, or – as I will suggest – use a more useful numbers like likelihood ratios, and just stop talking about sensitivity and specificity altogether.

We make this mistake all the time in medicine. We adopt tests based on just the sensitivity or just the specificity. We use these tests, but clearly we don’t understand how they really work. Consider the Ottawa subarachnoid hemorrhage rule. Based on an excellent sensitivity, there are many who are widely pushing its use. However, the actual numbers for the Ottawa subarachnoid hemorrhage rule are a sensitivity of approximately 100% (with 95% confidence intervals down to 95-97%) and a specificity between 7.5 and 15%. (Perry 2017; Bellolia 2014; Chu 2018; Perry 2020) I just demonstrated that a test with a 90% sensitivity and 10% specificity is completely useless; does not change a patient’s chance of subarachnoid hemorrhage at all. Does this rule sound much better?

This shouldn’t have come as a surprise. By definition, sensitivity and specificity are clinically useless. Sensitivity is defined as the percentage of patients with a disease who are accurately identified by a positive test. It’s a measure of the accuracy of a test in a group of patients known to have the disease. Clinically, we don’t know if a patient has a disease. That is exactly why we are ordering a test. So the very definition of sensitivity tells us that it is not a measure we should be applying in a clinical setting. 

Predictive values can also be misleading

When we order tests, what we really want to know is, if the test is positive, what are the chances that this patient actually has the disease? Or, conversely, if the test is negative, what are the chances that the patient doesn’t have the disease? The positive and negative predictive values, respectively, tell us exactly that. If the positive predictive value is 95%, and the patient tests positive, there is a 95% chance that the patient has the disease.

This sounds like the perfect measure. It appears to tell us exactly what we need to know as clinicians. Unfortunately, the predictive values have a fatal flaw: they are inherently tied to the prevalence of the disease in the patients you are testing. You can’t generalize the number from one group to another. Just because a study states that a test has a negative predictive value of 99% doesn’t mean that it will be 99% for your patient, and that is obviously a problem. 

This is best understood with a simple example. Imagine using a coin flip to decide whether a patient has a pulmonary embolism (PE). In an emergency department setting, where 10% of patients being tested have a PE, when the coin comes up heads or “positive”, 10% of patients will have a PE, so the positive predictive value of my coin flip is 10%. When the coin comes up tails or “negative” 10% of patients have PE, and so the negative predictive value of the coin flip is 90%. In this setting, it is pretty obvious that the coin flip is not very good at diagnosing PE.

However, imagine that I decide to test the exact same coin flip in a PE follow up clinic, where 100% of patients are known to have PE. Now, when my coin flip comes up heads, 100% of patients have PE. My coin flip has a 100% positive predictive value for pulmonary embolism! I could probably get this published in a major medical journal (if the test was more expensive and someone was going to profit).

Conversely, if I decide to test my coin in asymptomatic individuals visiting their doctor for a yearly physical, I can generate the opposite results. Now, when the coin comes up tails, 0% of patients have PE, so my coin flip has a negative predictive value of 100%! It’s a perfect test – except obviously it’s not.

So predictive values can also be very misleading. These examples sound extreme, but are well represented in the medical literature. We have tested coronary CT angiograms in populations where 0% of patients have bad outcomes, and then gleefully proclaimed that CCTA has an amazing negative predictive value. Hopefully it is now obviously why such statements are ridiculous.

Although predictive values are closer to what we want to know when working clinically, they can clearly still be very misleading. Like the sensitivity and specificity, I think we would be better off if we just stopped talking about these numbers. 

Likelihood ratios: the diagnostic number that really matters

We need a measure that incorporates the risk of the patient in front of us and tells us how much that risk changes when the test is positive or negative. The solution is likelihood ratios. 

A likelihood ratio (as is implied by the name) is a ratio of two different probabilities: the probability of a patient with a condition having a given test result divided by the probability of a patient without a condition having the given test result. (The only difference between the positive and negative likelihood ratio in this formula is whether you are talking about the test result being positive or negative.)

At face value, this sounds a little complicated, but the result is exactly what we need clinically. When working clinically, we want to know what a test result means for the specific patient in front of us. The likelihood ratio will give you a number that adjusts your pre-test probability into exactly what you want: the chance that this specific patient has the disease given the test result you just got back.

The overall concept is very easy. You take your pretest probability and multiply it by the likelihood ratio and you get the posttest probability. Unfortunately, the math gets a little complex, because it uses odds, but the basic concept is simple. If you multiple by 1, your odds don’t change at all, so a test with a likelihood ratio of 1 is completely useless. If you multiple by a big number (say bigger than 10) then your chances of disease go up by a lot. If you multiple by a small number (say smaller than 0.1) then your chances of disease go down by a lot. 

If you want to get more specific than that, you can use the Fagan nomogram. It is incredibly easy. You just start with your pretest probability on the left, draw a line through your likelihood ratio, and it tells you your posttest probability on the right. Better yet, these days you can just use one of the many online calculators.

Bottom line

Sensitivity and specificity have been lying to us. The spIN / snOUT mnemonic that we all learned is incorrect. Sensitivity cannot be considered without specificity, and specificity cannot be considered without sensitivity. These numbers are counter-intuitive and don’t provide us with the information that we need clinically. We should stop using them.

When using a diagnostic test, you must first know your patient’s pretest probability. Once you know the pretest probability, it is the likelihood ratio that will give you the information you need.

References

Bellolio MF, Hess EP, Gilani WI, VanDyck TJ, Ostby SA, Schwarz JA, Lohse CM, Rabinstein AA. External validation of the Ottawa subarachnoid hemorrhage clinical decision rule in patients with acute headache. Am J Emerg Med. 2015 Feb;33(2):244-9. doi: 10.1016/j.ajem.2014.11.049. Epub 2014 Dec 3. PMID: 25511365

Chu KH, Keijzers G, Furyk JS, et al. Applying the Ottawa subarachnoid haemorrhage rule on a cohort of emergency department patients with headache. Eur J Emerg Med. 2018;25(6):e29-e32. doi:10.1097/MEJ.0000000000000523 PMID: 29215380

Perry JJ, Sivilotti MLA, Sutherland J, et al. Validation of the Ottawa Subarachnoid Hemorrhage Rule in patients with acute headache [published correction appears in CMAJ. 2018 Feb 12;190(6):E173]. CMAJ. 2017;189(45):E1379-E1385. doi:10.1503/cmaj.170072 PMID: 29133539

Perry JJ, Sivilotti MLA, Émond M, et al. Prospective Implementation of the Ottawa Subarachnoid Hemorrhage Rule and 6-Hour Computed Tomography Rule. Stroke. 2020;51(2):424-430. doi:10.1161/STROKEAHA.119.026969 PMID: 31805846

Worster A, Carpenter C. A brief note about likelihood ratios. CJEM. 10(5):441-2. 2008. PMID: 18826732

Cite this article as:
Morgenstern, J. The sensitivity and specificity are lying to you, First10EM, February 8, 2021. Available at:
https://doi.org/10.51684/FIRS.73339

Leave a Reply

16 thoughts on “The sensitivity and specificity are lying to you”

Discover more from First10EM

Subscribe now to keep reading and get access to the full archive.

Continue reading