How do you test the effectiveness of anti-spam tools? You feed thousands of messages–both spam and legitimate email–to these programs, then you carefully sift through everything that comes out the other end.
As part of our research for the article “Guard Your In-Box” (April 2003), we tapped into an archive of more than 250,000 spam messages from 1993 through 2002. Since it wasn’t practical to run all 250,000 spam messages through every anti-spam utility, we decided that recent spam would be a better test than old spam. We selected three quarters of the test messages from spam collected between December 2001 and December 2002. The remaining one quarter were from spam collected between November 1997 and December 2002.
We wrote a script to select spam messages randomly. To weed out near-duplicate messages, the script also examined each message’s word order and frequency: When one was too similar to a message already in the test set, it was discarded and another message selected.
We created sets of 1,000, 3,000, and 5,000 messages, then two sets of 10,000 spam messages–one to train Bayesian filters, and another to test them. With each utility, we used the largest set of spam messages possible given time and bandwidth constraints (testing online services is considerably more difficult than testing desktop utilities).
For legitimate email, we used email received between November 1997 and December 2002. One-third of the messages in each test set were from friends, family, and acquaintances; one-third were from work-related email; and the final third were from mailing lists we subscribed to. As with the spam message sets, we made sets of 1,000, 3,000, and 5,000 messages, then two sets of 10,000 messages each.
To configure anti-spam programs that support whitelists or other processing exceptions for mailing lists and buddies, we gathered the addresses of every mailing list represented in each message set, as well as the email addresses of individuals who appeared ten or more times in each message set. Where applicable, we then entered these mailing lists and personal addresses into the anti-spam utilities before testing.