Saturday, 29 April 2006

Visit SpamOrHam.org and assist anti-spam research

Last week, John Graham-Cumming launched SpamOrHam.org. If you're familiar with 'Hot or Not' you'll probably get the idea. As Graham-Cumming says:

The basic idea is to get humans (that means you) to read a small number of messages (some are ham; some are spam) and decide what they are. I'm doing this because there are currently two usable corpuses of spam and ham: the SpamAssassin Public Corpus (which was hand sorted) and the TREC 2005 Public Corpus (which was machine sorted) ... Once I've got enough human decisions (I'd love to get 10 per message; that means almost 1,000,000 human classifications) I'll make all the data public.

In other words, if you visit the site, you can vote on individual messages, to say whether or not you think they are spam or legitimate. This voting will be very helpful to spam researchers, because an acurate "corpus" of spam and ham allows them to automatically test new anti-spam techniques. Graham-Cumming continues:

I'll highlight any emails where people disagree with the current classification published by Gordon Cormack ... I expect it'll throw up some interesting data... for example, just how good are humans are sorting spam? Since we'll be able to look at where the corpus and the humans disagree we'll be able to spot machine errors and human errors.

Friday, 28 April 2006

Tips for your new anti-spam idea

So you have a fantastic new idea to solve the spam problem once and for all? Of course, you're sure it'll work brilliantly and you're sure nobody else has thought of it.

Sounds like you've come up with what spam fighters call a FUSSP -- a Final Ultimate Solution to the Spam Problem. Vernon Schryver maintains a list of fallacies that appear again and again from FUSSP inventors. It's fairly impenetrable to those outside the spam-fighting clique (as some think of it). So here are a few rephrased highlights. Think of them as tips to prevent making yourself look foolish:

  • Don't assume that spammers are stupid.
  • Don't rely on email recipients changing their behavior with nothing to show for it.
  • Don't rely on other email senders responding to automatic challenges (or on victims of challenges sent to forged addresses not to respond).
  • Don't rely on all ISPs, web hosts, and registrars being active, reponsible, spam-hating net citizens.
  • Don't propose replacing SMTP, DNS, TCP/IP, Microsoft Exchange, Lotus Notes/Domino, or other immovable objects.
  • Know what these terms mean: tarpit, DNSBL, HELO, EHLO, MX, RMX, MTA, MUA, DCC.
  • Know the difference between the SMTP envelope and header.
  • If your scheme requires a new standard, make sure you understand how standards are set on the Internet -- at a minumum, read and understand RFC 2223 and RFC 2026.
  • With few exceptions, strangers won't pay money to send you mail.