In the spirit of Carnac the Magnificent, the title of this post could be subtitled: Name three things you don't want.

The other day, Google News showed me this article of people being misdiagnosed with lyme disease. From the article:
The Public Health Agency of Canada says a review of its Lyme disease testing methods has turned up 24 patients in five provinces who received false-negative test results.
The testing was done at the agency's National Microbiology Laboratory in Winnipeg. The review, a quality control analysis, looked at more than 1,500 samples tested for Lyme disease and discovered the 24 incorrect test results last month.
A similarly themed article appeared in the New York Times about the guidelines for detecting prostate cancer and the apparently many needless biopsies that result. These articles made me recall a story from my former graduate statistics professor who told us that upon applying for Canadian citizenship, he and his wife were repeatedly flagged for having syphilis. He assured us that, contrary to what the tests indicated, neither of them had the disease. That is, their tests repeatedly came up as false positives.
I thought this would be a useful opportunity to discuss a topic with widespread implications, many of which are not immediately obvious to those just learning statistics: Signal Detection Theory, or SDT for short. Most psychology students hear about SDT when learning about Type I and Type II errors, which I will discuss later in this post, although the connection between the latter topic is not always made with the former; see here for a welcome exception.
As its name suggests, signal detection theory is concerned with determining whether or not a desired stimulus - the signal - is or is not present in a given scenario. Some examples will hopefully make this idea clearer. When reading these examples, consider them in the context of determining whether or not the detection mechanisms or tests are operating correctly.
Imagine a soldier examining a radar screen looking for signs of enemy aircraft. Delivering the right information in this job is of vital importance in determining whether or not the country goes to war. For every sweep of the radar, the soldier is in one quadrant of this 2 x 2 matrix:
|
What the soldier sees |
| Blip |
No blip |
|
Type of aircraft
|
Enemy aircraft |
To war! |
Big problem |
| Friendly aircraft |
Egg on soldier's face |
Keep calm and carry on |
Imagine the soldier sees a blip:
- If the blip corresponds to an enemy aircraft, the soldier warns the general and, in turn, the country goes to war.
- If the blip on the radar corresponds to a friendly aircraft, which suggests the radar is malfunctioning, the soldier looks foolish and has wasted the general's time.
In one scenario, there is a huge cost of lives and resources, but the soldier did what was required and expected of him/her in this situation. In the other scenario, the soldier looks foolish and, at worst, appears incompetent in the general's eyes. Which of these is worse?
Let's take this discussion one step further: Imagine the soldier does not see a blip, which means that s/he will not speak to the general.
- If no blip was spotted, but there is, in fact, an enemy aircraft, the radar may be malfunctioning and the country becomes the victim of an ambush.
- If no blip was spotted and a friendly aircraft passed through the radar space, then the radar is performing properly, the country stays at peace and the soldier waits for the next sweep of the radar.
Here, the question of "which is worse?" is easier to answer, as the no blip / friendly aircraft scenario is more desirable. In truth, however, all four quadrants need to be considered when answering the question of "which is worse?" I will come back to this idea soon.
I chose the radar example for a number of reasons, not least of which is out of historical deference to the formative work done in signal detection theory. It is easy to see the importance of making the correct decision in the context of a soldier scanning a radar screen. This same logic applies to the test for lyme disease that prompted this post.
Imagine you are schoolteacher, one Miss Elizabeth Hoover, awaiting the results of your lyme disease test. In this example, the above grid now looks like this:
|
Test results |
|
Positive
(indicates lyme disease)
|
Negative
(indicates absence of lyme disease)
|
|
Does Miss Hoover actually
have lyme disease?
|
Yes |
Take necessary sick leave:
Stay home, call substitute,
seek treatment
|
I'm sick but don't know why... |
| No |
Take unnecessary sick leave:
Stay home, call substitute,
waste time and resources
|
Back to work...
perhaps symptoms were all in her head
|
Again, the outcome is of major importance. Like the radar example above, the implications have life or death consequences:
- If the test indicates the presence of lyme disease and Miss Hoover actually has lyme disease, she will be sick for a time but will receive treatment and live.
- If the test indicates Miss Hoover has lyme disease but she is, in truth, disease-free, she will have taken unnecessary sick leave from work and wasted the time and resources of the health-care system.
- If the test is negative but Miss Hoover really does have lyme disease, she will get increasingly sick and eventually die. The doctor who administered and misinterpreted the results could also be sued for malpractice.
- Finally, if the test is negative and Miss Hoover is free of lyme disease, which is clearly the best case scenario, then she goes back to work.
Like with the radar example, the negative test-no lyme disease quadrant is best. However, like the radar example, we need to consider all four quadrants together. But why? The answer lies in one key element of signal detection theory: Tradeoffs.
The question we must answer in SDT is this: Are we more willing to accept (a) detecting a fake, non-existent signal or (b) failing to detect an real, existing signal? That is, are the consequences worse for (a) claiming to find something that is not actually there or (b) ignoring something that is actually there? In the language of signal detection, we are asking whether we are more tolerant of (a) false positives (or Type I errors) or (b) false negatives (or Type II errors):
|
Decision |
|
Signal present
|
Signal absent
|
|
Reality / Truth
|
Yes |
Hit
|
False negative / Miss /
Omission / Type II Error
|
| No |
False positive / False alarm /
Commission / Type I error
|
Correct Rejection
|
Think about the above examples in the context of this matrix:
- If our test indicates a signal is present and the signal is actually present, we count that as a hit. That is, the strength of the signal was greater than that of the noise.
- If our test indicates a signal is present but there is no actual signal, we count that as a false positive, also called a false alarm or commission error. In statistical hypothesis testing, these errors are Type I errors: We claim to have found a significant difference when, in fact, this difference occurred by chance.
- If our test fails to detect an actual signal, we count that as a false negative, also called a miss or omission error. In statistical hypothesis testing, these are Type II errors: We report that no significant difference was found when, in fact, there is a significant difference.
- If our test correctly indicates that no signal is present, we count that as a correct rejection.
In any given scenario, a decision must be made whether a signal is present or absent. That is, a particular threshold or criterion must be used to make this decision. In the case of the radar, the proximity, shape and flight pattern of an aircraft factor into whether the radar registers a blip or not. In the case of the lyme disease test, the number of antibodies in the blood functions in a similar way.
As mentioned in a previous post and is well-established in psychometrics, especially classical test theory, there is no test that is perfectly reliable. In turn, there is no test that is perfectly valid. Taken together, these points imply that we cannot know someone's true (i.e., unbiased, error-free) score. If we did then, in theory, the field of psychometrics would be pointless and there would be no need for measurement. But what does this have to do with signal detection theory and tradeoffs?
Given we have imperfect, error-prone measures, we will never achieve a perfect hit rate (or perfect correct rejection rate). Every test produces mistakes. Therefore, wherever the criterion is set determines the relative amounts of false positives and false negatives.
Here's the tradeoff: There is an inverse relationship between false positives and false negatives: As one increases, the other decreases. Put differently, a more conservative, stringent threshold yields fewer false positives but more false negatives. Conversely, a more liberal, looser threshold yields fewer false negatives but more false positives.
Take the lyme disease example here. Let's assume that we have two populations, one of which does not have lyme disease (no signal present, the curve on the left) and another that does (signal present, the curve on the right). Let's also assume that levels of lyme disease antibodies follow a normal distribution.
The Centers for Disease Control (CDC) has established criteria for interpreting test results to determine if lyme disease is present. As seen in this excellent demonstration at WolframAlpha, if a criterion is too stringent, we will only be catching a small proportion of cases that actually have lyme disease. In doing so, we have very few false positives as only those individuals with the highest amount of antibodies will be diagnosed. At the same time, we are excluding those individuals who may have the disease but do not have a high level of antibodies. Conversely, if a criterion is too loose, we will have very few false negatives as many people will be diagnosed as having lyme disease. However, we will have many false positives given that many of those diagnosed do not actually have the disease. As the CDC admits, false positives and negatives for lyme disease tests, like any other test (such as that for Epstein-Barr or influenza), are possible.
As mentioned previously, the key question that must be answered is: Which is worse?
This question is not to be answered lightly.
Is it worse to (a) tell someone who is disease-free that they now have a potentially debilitating illness (false positive) or (b) tell someone who has the disease that they are disease free (false negative)? Remember, as rates of one increase, rates of the other decrease. In the case of a false positive, a person needlessly experiences increased stress, gets treatment and taxes the resources of the health care system that would be better spent helping a truly sick person. In the case of a false negative, a person continues to get ill and may die, but the health care system saves resources. These are tough decisions that are for the ethicists, health-care providers and politicians to make.
There are, of course, many more applications of SDT that affect all of us at some point in our lives. Decisions made under uncertainty fall under the rubric of SDT. Think about when a decision must be made but the truth of the situation or the long-term impact of the decision is unknown. Such examples include:
In closing, let's go back to the article mentioned at the beginning. The inquiry found a false positive rate of 1.6%. It is difficult what to make of this number given that hit rate and false negative rates were not listed, nor were typical false positive rates for other tests. One might infer from these omissions that the reporter felt that incorrectly diagnosing a nonexistent problem is of bigger concern to the public than incorrectly diagnosing an existing problem. It would have been more appropriate and judicious to present all of these values and allow us to determine exactly how problematic this false positive rate is. It is worth noting that the actual incidence rate of lyme disease is very low, thankfully, although rates are on the rise.
As detection methods become increasingly refined and accurate, false positive and false negative rates will drop, but they will never go to zero. Given this perpetual degree of inaccuracy, however slight, we are constantly torn by the tradeoff between misidentifying (a) desired outcomes as undesirable or (b) undesirable outcomes as desirable.
Further reading: