[Evolution] Bogofilter server side, instead of SpamAssassin client side, but
doing learning client side with Evo's UI
Andrew Cowie
andrew@operationaldynamics.com
Thu, 05 May 2005 13:58:20 +1000
--=-80v+H1uwiN3X976HeEI7
Content-Type: multipart/mixed; boundary="=-5IYMHOT3K3aiVIS9Qbps"
--=-5IYMHOT3K3aiVIS9Qbps
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable
Hey,
Just want to describe an alternate (not spamassassin) based spam
filtering & training setup I've been using successfully with Evolution,
in case anyone is interested.
So I use bogofilter ( http://bogofilter.sourceforge.net/ ) as my spam
identifier. I've had good luck with it over the past couple years.
Indeed, I tried switching back to spamassassin once Evo 1.5 got serious
about the built in hooks to fire up spamd and use spamc to talk to it,
but after a while found that I was really unhappy with spamassassin's
performance [at correctly identifying spam].
More importantly, I do my bogofilter-ing server side, inline the
delivery pipe that my [ISP's] server uses (they run Qmail, and it's easy
to hook into the local delivery process).
So that left me with the problem of wanting to use Evolutions Junk / Not
Junk buttons to train my filter (with the resultant wordlist file
sitting on the client) but wanting to have that wordlist up on the
server for the bogofilter there to work off of.
The solution was three fold: override what Evo does to train, rsync the
word list serverside and do the actual scanning there, but do the actual
"check messages headers and sort accordingly" in a Evo client side rule.
In order:
(1)
First, glancing at the code in em-junk-filter.c, I was able to figure
out what calls Evo is making when one presses the Junk or Not Junk
buttons. It composes a command line along the lines of=20
sa-learn --spam --norebuild < MESSAGE_DATA
and
sa-learn --ham --norebuild < MESSAGE_DATA
So what I did was override the sa-learn file [1]. Since I didn't want to
try and replace a system binary (whether or not spamassassin was
installed) [2], I wrote a tiny wrapper script and stuck it in ~/bin. The
wrapper intercepts the call to sa-learn, and instead calls bogofilter -s
or -n, as appropriate, to learn. I attached my script for anyone
interested.
Of course, to ensure that Evo sees my script instead of /usr/bin/sa-
learn, I need to invoke Evolution as
PATH=3D~/bin:$PATH /usr/bin/evolution
Which isn't that big of a deal [3].
(2)
I now have a growing, better trained ~/.bogofilter/wordlist.db on my
client machine.=20
But I want to do the actual scanning server side, because it means that
the CPU work of spam checking and preliminary sorting will be done ahead
of time, before I see the messages.
So I simply use rsync to push that file to the server. Nothing more
complicated than
rsync --verbose \
--recursive \
-e /usr/bin/ssh \
--partial \
--progress \
~/.bogofilter afcowie@server.mycolo.com:/home/afcowie
On the server, my delivery instruction (a .qmail file) is along the
lines of
| /var/qmail/bin/preline /home/afcowie/bin/bogofilter -H -e -p \
| /home/afcowie/bin/maildrop
The -e -p to bogofilter passes messages through regardless (don't want
positives to be bounced right there, tempting as that may be, because we
want to be able to train false positives and false negatives on the
client in Evo with those terrific zippy Junk / Not Junk buttons!)...
... and maildrop (think procmail) has a really great little mail sorting
language, see http://www.courier-mta.org/maildropex.html . So server
side I do preliminary sorting of traffic to folders titled Clients,
Boards, and Lists (just so that if I *am* using webmail, I have a chance
in hell of seeing messages from my customers - also helps downstream
when composing rules for vFolders in Evo). Note that I *don't* railroad
a message marked with X-Spam-Status: Yes off to a ProbableSpam folder or
whatever because if, in Evo, I find a false positive or negative, I
want to be able to train it using Evo's wonderful UI.
(3)
New messages are fetched by Evolution's IMAP code across four folders.
In combination with NotZed's one liner "apply filters to all IMAP
folders" patch [4], I set up an incoming Filter set up to look for X-
Spam-Status: Yes, and if so, does "Set Status" as "Junk" (puts it in the
Junk auto-vfolder) & "Set Status" as "Read" (so that it doesn't clutter
my unread counts). [5]
And done!
If I get a wrongly classified message, I use the {Junk | Not Junk}
buttons. Evo moves the message {to Junk meta folder | back to the folder
it came from and should have been in}, and calls sa-learn (which I've
overriden to call bogofilter} to learn from the mistake.
And I periodically push via rsync bogofilter's wordlist up to the
server. [Note I'm not using autolearn server side, because then there
would be a two way sync problem, and there's no reason to, really]
And it all Just Works (tm)
AfC
Sydney
[1] This is all highly dependent on the exact form of the exec calls in
em-junk-filter.c . If those change, this will need to be tweaked.
[2] In fact, it turns out that the training code attempts to activate
spamd, and if it fails, bails out without doing any training. That's not
very good, because it means in my case I have to have SpamAssassin
installed, just so Evo can start it, just so I can ignore it and do
bayes training. However, I'd say more generally that firing up spamd is
is unnecessary if all the user is doing is training (indeed, if they
don't have "filter incoming messages for junk selected) then that fire-
up-spamd should never need to happen - but still, allow the training
cycle to occur.
[3] But it sure would be nice if I could just tell evo what training
program to use. Devs aren't about to write that UI, I know.
[4] No problems with the filter on all folders thing so far!
[5] I know Jeff is going to be working on the IMAP code again sometime
soon. It seems like under POP the messages get passed to the filters
before they show up as unread in a folder; in my IMAP case, I get a blob
of unread messags in INBOX, then half a second later they vanish as they
get Junk classified. Not sure if that's fixable.
[6] hey, so I just attached a little shell script as an example, but
it's showing up as MIME type application/x-shellscript . I certainly
wouldn't want anyone's client to try and just *run* this script (its not
like its a photo which needs a viewer) - I want to deliver it as
text/plain so people can glance at it if they want to. How do I do that?
Hm. Anyway, to workaround and achieve text/plain, I stripped the
#!/bin/sh line.=20
--=20
Andrew Frederick Cowie
OPERATIONAL DYNAMICS
Operations Consultants and Infrastructure Engineers
http://www.operationaldynamics.com/
--=-5IYMHOT3K3aiVIS9Qbps
Content-Description:
Content-Disposition: attachment; filename=sa-learn
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: base64
DQojDQojIGF1dGhvcjogQW5kcmV3IEZyZWRlcmljayBDb3dpZSA8YW5kcmV3QG9wZXJhdGlvbmFs
ZHluYW1pY3MuY29tPg0KIyB1c2Ugb25seSBhcyBkaXJlY3RlZCwgc2VlIHlvdXIgZG9jdG9yIGlm
IHN5bXB0b21zIHBlcnNpc3QuDQojDQojIFBsYWNlIGFzIH4vYmluL3NhLWxlYXJuLCAobm90IGEg
Z29vZCBpZGVhIHRvIHJlcGxhY2UgL3Vzci9iaW4vc2EtbGVhcm4pDQojIE1ha2Ugc3VyZSB5b3Ug
Y2FsbCBFdm9sdXRpb24gd2l0aCBQQVRIPX4vYmluOiRQQVRIIC91c3IvYmluL2V2b2x1dGlvbg0K
IyAoaWUsIGV2ZW4gaWYgaXRzIGluIHlvdXIgc2hlbGwgcGF0aCwgaWYgeW91IGhhdmUgYSBsYXVu
Y2hlciBvciBrZXkgc2hvcnRjdXQNCiMgeW91IG5lZWQgdG8gYWRkIHRoZSBQQVRIIHRoaW5nIGFi
b3ZlLiBNeSBsYXVuY2hlcnMgbG9vayBsaWtlDQojIA0KIyAgICAgIHNoIC1jICdQQVRIPX4vYmlu
OiRQQVRIIExBTkc9ZW5fQ0EgTENfVElNRT1lbl9HQiAvdXNyL2Jpbi9ldm9sdXRpb24nDQojDQoN
CmVjaG8gIkFmQzogaW4gc2EtbGVhcm4gLT4gYm9nb2ZpbHRlciB3cmFwcGVyIiAxPiYyDQoNCmZv
ciBvcHRpb24gaW4gJCoNCmRvDQoJY2FzZSAiJG9wdGlvbiIgaW4NCgktLXNwYW0pDQoJCWVjaG8g
IkFmQzogbGVhcm4gYXMgc3BhbSIgMT4mMg0KCQlNT0RFPSItcyINCgkJOzsNCgktLWhhbSkNCgkJ
ZWNobyAiQWZDOiBsZWFybiBhcyBoYW0iIDE+JjINCgkJTU9ERT0iLW4iDQoJCTs7DQoJLS1yZWJ1
aWxkKQ0KCQllY2hvICJBZkM6IHJlYnVpbGQgY2FsbGVkIC0gd2lsbCBleGl0LiIgMT4mMg0KCQll
eGl0IDANCgkJOzsNCgkqKQ0KCQllY2hvICJBZkM6IGlnbm9yaW5nIG9wdGlvbiAkb3B0aW9uIiAx
PiYyDQoJCTs7DQoJZXNhYw0KZG9uZQ0KDQplY2hvICJBZkM6IGV4ZWN1dGluZyBib2dvZmlsdGVy
IC12IC1sICRNT0RFICIgMT4mMg0KZXhlYyBib2dvZmlsdGVyIC12IC1sICRNT0RFDQoNCg==
--=-5IYMHOT3K3aiVIS9Qbps--
--=-80v+H1uwiN3X976HeEI7
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
iD8DBQBCeZncLVETDFf2570RAr8bAJ99FBxJp7KwHGNeKZXSAzmC9dqzXwCcCHEh
CS27/J60+M2MGF9NOrR9THU=
=VoQW
-----END PGP SIGNATURE-----
--=-80v+H1uwiN3X976HeEI7--