on the edge

computers & technology, books & writing, civilisation & society, cars & stuff


Greg Black

gjb at gbch dot net
Home page
Blog front page


If you’re not living life on the edge, you’re taking up too much space.


FQE30 at speed



Syndication / Categories

  All
   Announce
   Arts
   Books
   Cars
   Family
   House
   Meta
   People
   Places
   Random
   Society
   Software
   Technology
   Writing



Worthy organisations

Amnesty International Australia — global defenders of human rights

global defenders of human rights


Médecins Sans Frontières — help us save lives around the world

Médecins Sans Frontières - help us save lives around the world


Electronic Frontiers Australia — protecting and promoting on-line civil liberties in Australia

Electronic Frontiers Australia



Blogroll

(Coming soon…)



Software resources


GNU Emacs


blosxom


The FreeBSD Project

Wed, 30 May 2007

A new approach to spam filtering

About three years ago, I first considered DSPAM as a potential solution to the incoming tide of spam that was drowning me and that was increasingly overwhelming SpamAssassin, my then tool of choice. I wrote a couple of blog entries that discussed my research and included references to papers by Jonathan A. Zdziarski (the author of DSPAM) and Gordon Cormack (who, with Thomas Lynam, wrote an evaluation of anti-spam tools). I also mentioned some of my discussions with both Zdziarski and Cormack and said I would report more when I had more information.

Much time has passed and the spam problem has, as we all know, continued to get worse. Last December, having become completely fed up with the worsening performance of SpamAssassin, I decided to install DSPAM for testing. I elected not to bother training it, but allowed it to do its thing and contented myself with informing it of its errors. The downside was that I had to look at every incoming message, whether spam or not, to be sure of the classification. I have examined 82,931 messages in the last five months and I’m amazed at how well DSPAM works.

Overall, it has caught 98.92% of all spam and its false positive rate has been 0.02%. Most of the errors were in the first and second months while it was learning. Now, it is catching over 99.2% of spam with a false positive rate below 0.01% and there have been no false positives at all for a couple of months. For my wife, the learning was a little slower because she receives much less total email than me and her legitimate email volume is so small that it’s a bit of a challenge to get enough for training. However, even in her case, the detection rate is up to 98.90% and false positives have also disappeared.

I was going to modify qmail to reject messages that were deemed to be spam, but I’ve decided that it’s too much work, given the ickiness of the qmail code and the excellent performance of DSPAM. I also toyed with the idea of changing MTA, but I have not found an MTA that I would be willing to use that also has the ability to do what I want. I may one day decide to write my own MTA for in-house use, but for now I’m going to stick with qmail and the other modifications I had made to it in the past and—starting right now—I’m going to stop my practice of reviewing incoming spam in case any legitimate email is lurking there.

In other words, from now on, if anybody emails us and DSPAM thinks it’s spam, nobody will ever see the message. There will be no bounce, there will be no error message, there will be no sign that the message was lost. But it will be irretrievably lost. I have decided that the time spent on reviewing the spam is not worth the rewards, when the chances of finding a real message seem to be less than one in a million now. This is especially true when it’s also true that anybody who might need to contact us and who we would care to hear from has other methods of doing so.

I am really delighted to have got to the point where my spam load consists of hitting ‘S’ once a day to tell DSPAM about something it has missed.