9 May 2011

It's been awfully quiet around here...

...And for that I do apologize. The last month has been very busy for me in a number of domains, but I do have a few posts up my sleeve that should be up soon.

At the very least, I'd like to mention Nathan Yau's excellent blog Flowing Data, which has some fantastic visualizations of data and statistics of every stripe. In particular, he recently mentioned an excellent (and free!) iPad app called Stats of the Union, which, as its name suggests, provides lots of solid, well-represented data on all sorts of key indicators in the USA. With some swipes and presses, your iPad can display such titbits as population density, the proportion of the population living below the poverty level and the levels of at-risk groups. The app is sophisticated enough to display these results county by county. Given it's low, low price of $0, it's definitely worth downloading. 

Americans may do so here. Canucks may do so here.

In other news, specifically news related to a previous post, Diesel Sweeties creator R. Stevens created this fantastic, Toronto Comic Arts Festival (TCAF)-only shirt celebrating a national pastime. It is now in my possession and I plan to sport it proudly about town and beyond.

I would also like to give a very public and heartfelt shout out to Ryan North, creator of the fantastic Dinosaur Comics and all-around nice guy. I volunteered at TCAF this past weekend and during my shift, discovered that Ryan was almost out of books. I mentioned this to him when I happened to run into him. He then said he'd bring one to me, which he promptly did. This would have been super nice enough but what was even nicer was that he didn't charge me for it...he gave it to me as a thank you for volunteering! Super nice and funny -- Ryan's the whole package! You should totally read Dinosaur Comics, and not just the math-themed ones.

3 Apr 2011

On the non-panacea that is Bayesian statistics

I've linked to Andrew Gelman's must-read blog for a good reason: His posts are accessible, well-written and always informative. This quote, courtesy of Susan Ellenberg, from a recent post on issues with Bayesian statistics, says so much so succinctly:

You have to make a lot of assumptions in order to do any statistical test, and all of those are questionable.

Gelman's own point on the issues with a magical threshold that tests must cross to be considered significant, what he calls the "statistical significance filter", is also noteworthy:

As a result of this selection bias, statistically significant estimates tend to be overestimates--whether or not a Bayesian method is used, and whether or not there are any problems with fishing through the data. 

As the namesaked article of this blog discussed so well so many years ago, we need to think more about what our results mean and how to present and interpret them. Simply relying on an arbitrary, preset standard just isn't good enough. Ironically, it seems like a rather anti-scientific thing to do. 

The promise of Bayesian statistics in social science has loomed large for some time. Certainly, Gelman, himself an expert on the Bayesian approach , has blogged extensively on this topic. I appreciate his reasoned argument that the Bayesian approach, like any other approach, is not a panacea for the ills of null-hypothesis significance testing, let alone an alchemical approach that spins straw into gold

I do think, however, that the Bayesian approach should be covered in greater detail in social science statistics courses. While a grad student, I recall seeing an impassioned talk by decision-making expert Gerd Gigerenzer about how easy it is to teach and understand Bayes' theorem, despite its aura of apparent difficulty. It seems a shame that his advice appears to have gone unheeded, although I am open to hearing evidence to the contrary.

It's always worth having more tools in one's toolbox. Knowing how and when and why to use those tools is the key.
21 Mar 2011

Assorted comic thingies

Some statistics-themed levity for this first, rainy, Monday of spring, courtesy of the excellent PHD Comics.

In no particular order...

Reread your college stats notes, news media: "63% of internet readers will like this comic"

phd012010s.gif


What happens after years in grad school: "Your Math Skills"
phd081310s.gif

I'm a bigger fan of regression, but ANOVA has its uses too: "Analysis of Value"

phd082707s.gif


Bonus laugh: The very funny John Oliver from the very funny "Community" discusses The Duncan Principle. His outburst and anger at the results are not only all too familiar but also produced one of the funniest, stats-related curses I've ever heard:

Enjoy!

16 Mar 2011

"Roll Up The Rim" and the binomial distribution

It's that time again in Canada, where we frequent our myriad hockey-themed coffee bars in the hope of winning fabulous prizes, like cars, barbeques, and free coffee / donuts. I was particularly interested in tracking how, where and when folks were winning or losing.

A cursory Google search of "Roll up the rim statistics" led me to this blog from Jon Lin, who is keeping track of his wins and losses from this year's contest, just like he has done for previous years. He is keeping a live Google Docs spreadsheet of his progress, including such variables as what he ordered (seems to prefer Steeped Tea) and the size of the beverage (consistently large), all of which I think is a great idea. Certainly an interesting and fun application of the data-driven life.

Even better, Jon has a very accessible post on understanding the binomial probability distribution (it's not just for coin flips anymore!) as they pertain to wins and losses in Roll Up the Rim. By using a simple Excel function, you can plot out a distribution of the likelihood of a certain number of wins given the stated odds, as well as the cumulative probabilities associated with these outcomes. You can also determine how likely your particular performance is.

This year, I have had one win out of six coffees (all large, double-doubles with milk and sweetener, FWIW). The stated odds this year are, in fact, 1 in 6. Using the binomial distribution function, I discovered that my case has a 40.2% likelihood of occurring. Moreover, the probability of this case or worse (i.e., winning nothing) occurring is a rather large 73.7%. I guess I should be happy with my one win, then...

From a statistical and psychological perspective, this annual contest is an interesting case of the Gambler's Fallacy, where, following a string of losses, one incorrectly expects future trials to produce success. Think about it: You've seen the hated "Please play again" so many times, you think "I'm bound to win something on my next cup!" only to, in all likelihood, find yourself poorer and disappointed (though likely caffeinated) yet again.

Rather than each Tim Horton's outlet list the number of winners in a given store, it would be more informative to list the number of winners relative to the number of cups sold (i.e., winners as a proportion of winners and losers). Of course, I understand why Tim's doesn't list this information, but it is statistically irksome nonetheless.

With all this said, I'm doing a run...can I get you anything?

10 Mar 2011

Differentiating between Type I and Type II Errors

Type I and Type II Errors

A well-known social scientist once confessed to me that, after decades of doing social research, he still couldn't remember the difference between Type I and Type II errors. Since I suspect that many others also share this problem, I thought I would share a mnemonic I learned from a statistics professor. Recall that a Type I error occurs when the null hypothesis is rejected when it is in fact true, while a Type II error occurs when a null hypothesis is not rejected when it is actually false. This distinction, of course, many people find difficult to remember.

So here's the mnemonic: first, a Type I error can be viewed as a "false alarm" while a Type II error as a "missed detection"; second, note that the phrase "false alarm" has fewer letters than "missed detection," and analogously the numeral 1 (for Type I error) is smaller than 2 (for Type I error). Since learning this mnemonic, I have not forgotten the difference between Type I and Type II errors!

Glad to see I'm not the only one who (1) needed some help in remembering the difference between Type I and Type II errors, and (2) also used the wonder of signal detection theory to help remember the difference.

This reminds me of a trick my grad stats TA taught me for remembering the difference between positive and negative skew, which I found are among the most important yet consistently misunderstood definitions:

Think of the number line, running horizontally from negative infinity to positive infinity. If the tail of the distribution is to the left, or the negative side of the number line, then you have negative skew. Similarly, if the tail is to the right, or the positive side of the number line, then you have positive skew. Ta da!

Remember: You are interested in where the tail is, not where the bulk of the distribution lies. I think part of the confusion comes from our (reasonable) tendency to focus on the latter rather than the former. With skewness, it's all about the tail.

9 Mar 2011

What's the most money you can win on Jeopardy!?

Media_httpimgslatecom_tscjn

OK, this is pretty awesome. It's a beautiful mix of statistics and trivia! I can only hope that this article goes meta and this question becomes a clue on Jeopardy!

9 Mar 2011

On sex and happiness

Now that I've got your attention...

An interesting piece from the consistently interesting Salon today, discussing the link between increased sexual partners and depression. Without giving anything away (ha ha), the article does a reasonably good job of presenting the evidence, what little there is, for and against the proposition. What struck me most was this paragraph, which contains quotes from economist Andrew Oswald (highlight added by me):

Moving beyond that common-sense interpretation, though: "As a statistician and behavioral scientist, there is no compelling reason to think that larger numbers of sexual partners are truly 'causing' less happiness," he says. It's more likely that the reverse is true. "I find Ms. Right; she makes me happy; I then don't need to look for any other sexual partners," he says. "Until I find Ms. Right, it is quite rational to have plenty of sexual partners, and as a bonus it's fun along the way." The problem is that it's hard to conduct a solid lab experiment on such a thing -- "it is not like testing a new sleeping pill against a placebo," he says. "Until the U.S. National Science Foundation allows us to randomly assign sexual partners to people in lab experiments, and if that happens I certainly have some undergraduate students who would happily sign up to that experimental roster, we are going to have to be super cautious before reading too much causality into the (interesting) correlations that we see in data."

This reasoned, cautious approach is, to my mind, very refreshing to read. A good scientist should, except in very rare occurrences or after an overabundance of research validates a position, take the position of "I / we don't know:" Better to be hesitant and admit more work is needed or that the conclusions are tentative than to take a more hard line, almost dogmatic approach and say that "This is the way it is." I imagine (and hope) that, for most scientists, this is their philosophy. Too often, though, the presentation of science in the media implies the latter. This rush to present the findings from ongoing research as representing the final word on a topic or at least the consensus of the (majority of) the scientific community is presumptuous at best and irresponsible or dangerous at worst.

In the current case, however, the highlighted quote from Oswald and the similarly nuanced and cautious conclusion reached in the article seems oddly jarring. The fact that Dr. Oswald mentions that the only way to conclusively determine a causal relationship is to randomly assign participants to experimental conditions of either many sexual partners vs few sexual partners is a crucial idea. Just because a history of more sexual partners was associated with higher rates of depression -- and, of course, a history of fewer sexual partners was associated with lower rates of depression -- does not mean that being more promiscuous led one to report greater negative affect, or vice versa. Of course, much like the famous study proposed by John Watson, the study proposed by Dr. Oswald would never be conducted due to ethical issues, let alone an inability to obtain funding.

In short, I applaud Dr. Oswald for making this point so clearly and succinctly and for the article's author, Tracy Clark-Flory, and the editors at Salon for including this point in the posted article. I can only hope that other scientists and, certainly, other media outlets follow this example.

1 Mar 2011

Lyme disease, syphilis and...false positives?

In the spirit of Carnac the Magnificent, the title of this post could be subtitled: Name three things you don't want.

File:Carnac.jpg

The other day, Google News showed me this article of people being misdiagnosed with lyme disease. From the article:

The Public Health Agency of Canada says a review of its Lyme disease testing methods has turned up 24 patients in five provinces who received false-negative test results.

The testing was done at the agency's National Microbiology Laboratory in Winnipeg. The review, a quality control analysis, looked at more than 1,500 samples tested for Lyme disease and discovered the 24 incorrect test results last month.

A similarly themed article appeared in the New York Times about the guidelines for detecting prostate cancer and the apparently many needless biopsies that result. These articles made me recall a story from my former graduate statistics professor who told us that upon applying for Canadian citizenship, he and his wife were repeatedly flagged for having syphilis. He assured us that, contrary to what the tests indicated, neither of them had the disease. That is, their tests repeatedly came up as false positives.

I thought this would be a useful opportunity to discuss a topic with widespread implications, many of which are not immediately obvious to those just learning statistics: Signal Detection Theory, or SDT for short. Most psychology students hear about SDT when learning about Type I and Type II errors, which I will discuss later in this post, although the connection between the latter topic is not always made with the former; see here for a welcome exception. 

As its name suggests, signal detection theory is concerned with determining whether or not a desired stimulus - the signal - is or is not present in a given scenario. Some examples will hopefully make this idea clearer. When reading these examples, consider them in the context of determining whether or not the detection mechanisms or tests are operating correctly.

Imagine a soldier examining a radar screen looking for signs of enemy aircraft. Delivering the right information in this job is of vital importance in determining whether or not the country goes to war. For every sweep of the radar, the soldier is in one quadrant of this 2 x 2 matrix:

 


What the soldier sees
Blip No blip

 

Type of aircraft                    


Enemy aircraft             To war! Big problem
Friendly aircraft Egg on soldier's face Keep calm and carry on

Imagine the soldier sees a blip:

  • If the blip corresponds to an enemy aircraft, the soldier warns the general and, in turn, the country goes to war
  • If the blip on the radar corresponds to a friendly aircraft, which suggests the radar is malfunctioning, the soldier looks foolish and has wasted the general's time. 

In one scenario, there is a huge cost of lives and resources, but the soldier did what was required and expected of him/her in this situation. In the other scenario, the soldier looks foolish and, at worst, appears incompetent in the general's eyes. Which of these is worse? 

Let's take this discussion one step further: Imagine the soldier does not see a blip, which means that s/he will not speak to the general.

  • If no blip was spotted, but there is, in fact, an enemy aircraft, the radar may be malfunctioning and the country becomes the victim of an ambush
  • If no blip was spotted and a friendly aircraft passed through the radar space, then the radar is performing properly, the country stays at peace and the soldier waits for the next sweep of the radar. 

Here, the question of "which is worse?" is easier to answer, as the no blip / friendly aircraft scenario is more desirable. In truth, however, all four quadrants need to be considered when answering the question of "which is worse?" I will come back to this idea soon.

I chose the radar example for a number of reasons, not least of which is out of historical deference to the formative work done in signal detection theory. It is easy to see the importance of making the correct decision in the context of a soldier scanning a radar screen. This same logic applies to the test for lyme disease that prompted this post.

Imagine you are schoolteacher, one Miss Elizabeth Hoover, awaiting the results of your lyme disease test. In this example, the above grid now looks like this:


Test results

Positive

(indicates lyme disease)

Negative

(indicates absence of lyme disease)

Does Miss Hoover actually 

have lyme disease?                                    

Yes  

Take necessary sick leave:

Stay home, call substitute,

seek treatment

I'm sick but don't know why...
No

Take unnecessary sick leave:

Stay home, call substitute, 

waste time and resources

Back to work...

perhaps symptoms were all in her head


Again, the outcome is of major importance. Like the radar example above, the implications have life or death consequences:

  • If the test indicates the presence of lyme disease and Miss Hoover actually has lyme disease, she will be sick for a time but will receive treatment and live.
  • If the test indicates Miss Hoover has lyme disease but she is, in truth, disease-free, she will have taken unnecessary sick leave from work and wasted the time and resources of the health-care system. 
  • If the test is negative but Miss Hoover really does have lyme disease, she will get increasingly sick and eventually die. The doctor who administered and misinterpreted the results could also be sued for malpractice.
  • Finally, if the test is negative and Miss Hoover is free of lyme disease, which is clearly the best case scenario, then she goes back to work.

Like with the radar example, the negative test-no lyme disease quadrant is best. However, like the radar example, we need to consider all four quadrants together. But why? The answer lies in one key element of signal detection theory: Tradeoffs.

The question we must answer in SDT is this: Are we more willing to accept (a) detecting a fake, non-existent signal or (b) failing to detect an real, existing signal? That is, are the consequences worse for (a) claiming to find something that is not actually there or (b) ignoring something that is actually there? In the language of signal detection, we are asking whether we are more tolerant of (a) false positives (or Type I errors) or (b) false negatives (or Type II errors):


Decision

Signal present

Signal absent

Reality / Truth                  

Yes  

Hit

False negative / Miss /

Omission / Type II Error

No

False positive / False alarm / 

Commission / Type I error

Correct Rejection

Think about the above examples in the context of this matrix:

  • If our test indicates a signal is present and the signal is actually present, we count that as a hit. That is, the strength of the signal was greater than that of the noise.
  • If our test indicates a signal is present but there is no actual signal, we count that as a false positive, also called a false alarm or commission error. In statistical hypothesis testing, these errors are Type I errors: We claim to have found a significant difference when, in fact, this difference occurred by chance.
  • If our test fails to detect an actual signal, we count that as a false negative, also called a miss or omission error. In statistical hypothesis testing, these are Type II errors: We report that no significant difference was found when, in fact, there is a significant difference.
  • If our test correctly indicates that no signal is present, we count that as a correct rejection.

In any given scenario, a decision must be made whether a signal is present or absent. That is, a particular threshold or criterion must be used to make this decision. In the case of the radar, the proximity, shape and flight pattern of an aircraft factor into whether the radar registers a blip or not. In the case of the lyme disease test, the number of antibodies in the blood functions in a similar way. 

As mentioned in a previous post and is well-established in psychometrics, especially classical test theory, there is no test that is perfectly reliable. In turn, there is no test that is perfectly valid. Taken together, these points imply that we cannot know someone's true (i.e., unbiased, error-free) score. If we did then, in theory, the field of psychometrics would be pointless and there would be no need for measurement. But what does this have to do with signal detection theory and tradeoffs?

Given we have imperfect, error-prone measures, we will never achieve a perfect hit rate (or perfect correct rejection rate). Every test produces mistakes.  Therefore, wherever the criterion is set determines the relative amounts of false positives and false negatives.

 

Here's the tradeoff: There is an inverse relationship between false positives and false negatives: As one increases, the other decreases. Put differently, a more conservative, stringent threshold yields fewer false positives but more false negatives. Conversely, a more liberal, looser threshold yields fewer false negatives but more false positives.

 

Take the lyme disease example here. Let's assume that we have two populations, one of which does not have lyme disease (no signal present, the curve on the left) and another that does (signal present, the curve on the right). Let's also assume that levels of lyme disease antibodies follow a normal distribution.

Image039

The Centers for Disease Control (CDC) has established criteria for interpreting test results to determine if lyme disease is present. As seen in this excellent demonstration at WolframAlpha, if a criterion is too stringent, we will only be catching a small proportion of cases that actually have lyme disease. In doing so, we have very few false positives as only those individuals with the highest amount of antibodies will be diagnosed. At the same time, we are excluding those individuals who may have the disease but do not have a high level of antibodies. Conversely, if a criterion is too loose, we will have very few false negatives as many people will be diagnosed as having lyme disease. However, we will have many false positives given that many of those diagnosed do not actually have the disease. As the CDC admits, false positives and negatives for lyme disease tests, like any other test (such as that for Epstein-Barr or influenza), are possible.

As mentioned previously, the key question that must be answered is: Which is worse? 

This question is not to be answered lightly.

Is it worse to (a) tell someone who is disease-free that they now have a potentially debilitating illness (false positive) or (b) tell someone who has the disease that they are disease free (false negative)? Remember, as rates of one increase, rates of the other decrease. In the case of a false positive, a person needlessly experiences increased stress, gets treatment and taxes the resources of the health care system that would be better spent helping a truly sick person. In the case of a false negative, a person continues to get ill and may die, but the health care system saves resources. These are tough decisions that are for the ethicists, health-care providers and politicians to make.

There are, of course, many more applications of SDT that affect all of us at some point in our lives. Decisions made under uncertainty fall under the rubric of SDT. Think about when a decision must be made but the truth of the situation or the long-term impact of the decision is unknown. Such examples include:

In closing, let's go back to the article mentioned at the beginning. The inquiry found a false positive rate of 1.6%. It is difficult what to make of this number given that hit rate and false negative rates were not listed, nor were typical false positive rates for other tests. One might infer from these omissions that the reporter felt that incorrectly diagnosing a nonexistent problem is of bigger concern to the public than incorrectly diagnosing an existing problem. It would have been more appropriate and judicious to present all of these values and allow us to determine exactly how problematic this false positive rate is. It is worth noting that the actual incidence rate of lyme disease is very low, thankfully, although rates are on the rise. 

As detection methods become increasingly refined and accurate, false positive and false negative rates will drop, but they will never go to zero. Given this perpetual degree of inaccuracy, however slight, we are constantly torn by the tradeoff between misidentifying (a) desired outcomes as undesirable or (b) undesirable outcomes as desirable. 

Further reading:

1 Mar 2011

Richard Florida's "Can data predict political revolutions?"

Richard Florida has a piece in The Atlantic about the extent to which data can predict revolutions. Following the lead set by The Economist and using a slew of variables, including life satisfaction, GDP per capita, percent of the workforce in the creative class, he and his colleague Charlotta Mellander produced the "Index of Political Unrest" (IPU). They go on to say that IPU scores are high in many Middle Eastern countries and regions currently experiencing political upheaval, including Egypt (.74) and Yemen (.75).

However, as Florida points out, the index appears to be "fallible". Scores of other countries undergoing similar changes, such as Libya (.51) and Bahrain (.40), are lower than might be expected. Similarly, the highest scores are found in countries that are not currently experiencing revolutions, namely Togo (.93), Mongolia (.83) and Armenia (.81).

Although I applaud Florida's data-based approach, his results seem problematic. I appreciate that he admits the shortcoming of the IPU index, but it seems that he should have fine-tuned the measure further. That is, there appear to be important variables that are not included in the IPU that contribute to unrest in a particular country. For example:

  • What impact does the outbreak of revolution in one country in a given region have on the likelihood of revolution in another, nearby country?
  • What is the role of the leader's speeches and those leading the rebels before and during a revolution? Peter Suedfeld and his colleagues / graduate students have written extensively on the role of integrative complexity in predicting wars and uprisings (see here for a recent example).
In addition, Florida excludes any discussion of the statistical properties of his index: Although selected scores and their rankings are listed, none of the distribution of scores, skew, mean and standard deviation are provided. These last two values would have been of great use in determining whether or not the different countries are actually different from each other or different than the average. For example, is the .10 difference between Bahrain (.30) and the US (.40) a meaningful one?

Florida says he will get into more detail and unravel this index further in a subsequent post. I find Florida's work interesting and insightful and I look forward to reading this later post, with the hope that these additional details will be provided.

27 Feb 2011

From McSweeney's: The Unedited Version Of 'The Social Network' Movie Poster.

Media_httpwwwmcsweene_ceciu

This is worth a laugh and a read: The always excellent McSweeney's deconstruct's the poster for "The Social Network". Great for psychology and statistics nerds (i.e., me and my readers :P )

Craig Nathanson's Space

I have a PhD in personality psychology / measurement and a continuing hunger for knowledge. I am particularly interested in the powerful applications of science, data and statistics in our everyday lives. The often inaccurate representations of these areas is a constant thorn.