(aka The Myth of Statistical Significance)

Get the Nobel Prize ready. I know Baskin-Robbins cures cancer in lab tests for a fact despite no medical training or testing.

How can I be so certain?

Because they have 31 flavors.

If I ran a test of people who eat each flavor of Baskin-Robbins, it’s very likely that at least one group would have statistically significant results at the .05 level that it cures cancer.  All “.05 level” means is that there’s a 5% or less chance that an event happened by random chance.  Given 31 shots at a one-out-of-twenty proposition, one or more is to have significant positive results.

(Before you rush out to buy a franchise, know there’s also a flavor or two that cause cancer at a .05 level.)

Why do I tell this story, delicious as it may be?

Because I have lived this story. Not with ice cream but with  direct mail.

I set up a 15-panel test, with only slightly modified results below:

You might say:

  • E2 is a strong winner
  • C had a strong showing as a concept
  • Don’t do what you did in A again – it had three of the worst six showings

Here’s the trick.  All these panels were the same audience receiving the same piece.  There was no test.  It was my first time doing a list pull, so I used the same instructions as the previous year when there was a 15-part test.

The actual test had about the same differences in results as the fake one.

What does this mean for your testing regime?

First, let’s recognize that a .05 level of statistical significance is as artificial as say,  only finding guys attractive who are six-foot-plus.  (Why, yes, I am 5’11”.  Why do you ask?)  You should have barely a modicum more certainty in a .049 significance as in a .051 one.

It also means that most tests don’t have winners or losers.  It’s largely noise from which it is  difficult to extract a meaningful signal.  One reason is that most direct marketing tests don’t have large enough sample sizes to determine large impacts.

I’d recommend playing around with the sample size calculator here to see the power necessary.

This is also why it’s important to start with a strong hypothesis where  testing is concerned.  You want to have a hypothesis like this:  “I believe this test should [increase response rate/increase average gift/increase donor lifetime value/some combination]” because of X.  That way, you’ll know what you are going to measure against instead of looking at a table like mine above and picking ‘winners’ based on wherever a bigger (and false, but alluringly so) number appears.

Your hypothesis should also be something that will have impacts beyond one communication. If you test red envelope versus white envelope, at the end of the test, you might get a statistically significant result.  But your results may be no different than if you mailed the red envelope twice by mistake.

Moreover, you’ve only made steps to maximizing the value of that single communication, rather than the value of a donor.  It’s when you focus on how to treat donors and what causes them to do what they do that you enter the realm of strategy, build donor loyalty, and maybe, just maybe, make your donors happier with their experiences.

So ask yourself – am I testing a cancer drug that has a chance of working because I have a theory about how it will metabolize?

Or am I testing ice cream flavors hoping one comes up statistically significant?

The latter can be a very rocky road.


This article was posted in: Direct mail, Fundraising analytics / data, Metrics, Testing, Uncategorized.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.