Splonk (Version 1.1.1)
======================

Copyright (c) 2001 Gary W. Renshaw <gary@spots.ab.ca>

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA


What it does
============

Splonk is a system for eliminating spam from your E-mail.  As it is
intended to Gaul spammers, it consists of three parts:

1) The .procmailrc and rc.splonk files the implement a small set of
generic spam filters that so far have eliminated ALL of my spam.

2) A trivial program called suck that accepts an infinite amount of
input from stdin as well as any number of arbitrary command line
arguments.  It is used as a stand-in program for testing.

3) The mail-bounce program.  This allows custom bounces of any mail
message to the original sender.  Mail-bounce is available separately
from http://www.spots.ab.ca/~gary/mail-bounce/.

Splonk is very easy to configure and install.


The Philosophy
==============

It will never be possible to eliminate all spam by rejecting certain
patterns because spammers can always think up a new pattern that isn't
covered.

Splonk first ACCEPTS certain items that you always want (such as
mailing list messages), then REJECTS certain people you never want
mail from (like your ex-).

After dealing with these individuals it looks for some common spam
patterns.  Not all spam may be rejected by this section but it does
keep the noise down.

Any mail that gets through all this is tested to see if it is actually
for YOU, either by direct addressing or by coming from someone you
know who hides the recipient list.

Any mail that is still remaining must be: (a) a new mailing list you
forgot to include; (b) someone who hides the recipient list you forgot
to include; (c) new spam.  In any case the mail falls into a mail
folder called "unknown".  You can review the contents by looking at
the subject lines in the splonk log so you never even have to look at
the rejected mail unless you want to.


Quick Installation
==================

Note: This documentation refers to version 1.1.1.  The tarball you
have may be for a different version.  Just change the version number
in the instructions to match what you have.

	tar zxvf splonk-1.1.1.tar.gz
	cd splonk-1.1.1
	make

That's it.  

If you want to keep your configuration answers handy for later
updating you can edit the ./answers file to your liking.  Comment
lines are ignored in this file but blank lines are necessary to signal
the end of a section and are significant.


Questions
=========

Q: What is this mail-bounce thing mentioned in the recipes?

A: It is a Perl program that generates a customised bounce message
from the spam and sends it back to the spammer.  See
http://www.spots.ab.ca/~gary/mail-bounce/ for details.

Q: Why bother with spam patterns when the recipe file rejects all
mail not addressed to specific users?

A: At some point spam is actually addressed to you personally,
otherwise it wouldn't have arrived in your mail box.  If you just rely
on mail not being addressed to you eventually you'll get spam.  By
including spam patterns we eliminate the majority of spam BEFORE we
even check if it is addressed to you.


Test Facilities
===============

See sections below for more details.

1) You can enable a test recipe in rc.splonk that rejects any message
with a special Subject line (Subject: procmail test).

2) You can switch from bouncing rejects to stuffing them into a special
mail folder so you can test if a specific mail gets caught.

3) You can turn on the procmail debugging (VERBOSE=yes in .procmailrc)
to see what is happening with inidividual recipes.

4) You can run the test.pl script to take a file full of spam and run
it through Splonk.  Normally I do this with my junk mail file as a
regression test to make sure I'm still catching all the spam after
changing rc.splonk.

To run test.pl just re-direct its stdin to use your junk mail file.
Something like:

./test.pl < ~/Mail/junk


The .procmailrc file
====================

Splonk creates a .procmailrc file under your home directory that points
to the directory ~/splonk.  It also contains two variables that may be
changed for testing:

BOUNCE normally points to the mail-bounce program.  For testing it
points to suck which just acts as a data sink at the end of the pipe
and also doesn't mind eating mail-bounce's command line arguments.
Think of it as the program version of /dev/null.

JUNK normally points to the mail file "junk" that will contain
everything that is rejected by Splonk as insurance against a mistake
being made.  For testing it points to the mail file "rubbish" (or
whatever you wish).  During testing the rejects go here instead which
is why there need to be two mail files.


RC.SPLONK SECTIONS
==================


Section 1: Initial Testing
==========================

This section in rc.splonk has one recipe that is normally commented
out.  It is used solely for testing your mail set up and bounces any
message with the exact subject line "Subject: procmail test".

Unless you want to test your set up you should leave this recipe
commented out.


Section 2: Fixing broken lines
==============================

Recipes have trouble finding the word "remove" in a message if it is
broken between two lines like this: re=
move.

The = at the end of the lines appears to be a "feature" of some
mailers.  This recipe detects messages with such lines and calls a
PERL script to eliminate the line breaks.  It may knacker the
formatting for some E-mail slightly but that's the price one pays for
remaining spam free.

After PERL has savaged the lines we call procmail again to complete
the processing of the message.

You should never have to modify this recipe.


Section 3: Get out of my life
=============================

There are some people you never want to hear from again.  This section
contains recipes that bounce any mail from those people (or
organisations).  Add as many recipes of this pattern as you like,
changing the name on each one.


Section 4: Mailing lists
========================

Some of the common spam phrases are also found in common mailing list
messages such as "to be removed", so we need to process legitimate
mailing list messages before dealing with the spam.

Repeat the recipe as many times as necessary, once for each mailing
list to which you have subscribed, substituting the names of the
mailing lists in each one.


Section 5: Header spam
======================

Some spam can be caught in the message headers.  For example, some
bulk mailing programs advertise themselves in the headers.  Other
spams are easily caught because they have obvious subject matter such
as the famous "Make Money Fast".

The trick is not to get too specific with subjects.  For example, a
recipe to get rid of "Make Money Fast" might not catch "Make Money
Quickly" or "Make $$$ Fast" or other variations.  Try to make each
recipe work hard for its living and catch many different kinds of
spam.

The bulk mailer recipe looks for the words "bulk" and "mail" in any
header so that it will catch "bulk mail" or "bulk mailer" or "bulk
mailing" with any amount of text in between the words.

It is equally important not to chose a recipe that will throw out
valid E-mail.  No real message should have "bulk...mail" anywhere in
its headers.


Section 6: Body spam
====================

These recipes catch spam phrases in the body of the message.  There
are several ways spammers try to get around phrase recognition.

The first is the trailing equals sign that splits words.  We took care
of that above.

Another is adding spaces between the letters of words such as "to be r
e m o v e d" so the word can't be recognised.  Fortunately the regular
expression ".?" matches one optional character so you can match a word
by using something like "r.?e.?m.?o.?v.?e"

The last common trick is to vary the wording slightly.  For example:

If you wish to receive no further mailings, reply ...
If you want to receive no further mailings, reply ...
If you wish to receive no further mailings, reply ...
If you want to receive no further advertisements, respond ...
If you don't want to receive further mailings, respond ...

You get the idea.  Fortunately, while English does not have the most
regular grammar in the world it does have a considerable degree of
regularity and that makes it possible to write regular expressions
that will catch a wide variety of different wordings for the same
meaning.

The following regular expression will catch all of these variations.

* you (don't)? (wish|want) to receive.+(reply|respond).+remove

Looking at several spams with this kind of phrasing we notice that the
rest of the sentence eventually uses the word remove (from the list,
from our database, etc.) so we add that to the end.


Section 7: Foreign spam
=======================

Spam is not just an English problem.  These recipes try to identify a
foreign language and bounce the message.  The assumption is that if
the message is in that language I just don't want to receive it at
all.  No phrase analysis is done so once the language is identified
the message is dumped.

How do we identify the language?  Simply by looking for some common
words.  For example, it is almost impossible to write a German E-mail
without using words like und, die/der/das, ist, sie, etc.  If we find
a bunch of them in the message we can assume the message is in German.

Don't be content to match just one word; you might eliminate a valid
E-mail (notice that the German die is also an English word).  The
recipe in this section matches four words.

I'll be adding more languages over time.


Section 8: Weird E-mail
=======================

E-mail messages should always have a group of headers followed by the
message.  If there are headers in the body then something is wrong.
These patterns detect that kind of weirdly constructed E-mail.


Section 9: Virus protection
===========================

Why in heaven's name you would ever want or need to execute a file
full of sound samples I have no idea, but obviously Microsoft thought
it was a good idea.  Just about any Microsoft-specific file format can
be executed and that means it can contain a virus.  What a great idea!

Even though .doc and .xls files can also contain viral code they are
left out of this bounce list because they are frequently sent for
legitimate reasons.  If you insist that people send you documents only
as text or RTF you can add doc and xls to this recipe to bounce them
too.


Section 10:  Legitimate Users
=============================

At last, we actually allow real users to get their mail.  All you have
to do is provide a recipe for each legitimate E-mail recipient that
you will be processing and route that mail to the $SPOOL variable set
up in the .procmailrc mentioned above.


Section 11: Hidden recipients
=============================

Everyone knows a few paranoid types who hide the recipients of their
messages.  In this section we provide a simple recipe for each such
person you know.

What we are saying in this section is "accept all mail from person X
regardless of who it is for".  Notice that these people will have to
pass the spam content filters first, so viruses mailed by your
friend's computer will still be caught.


Section 12:  Unknown
====================

If a message makes it this far there are two possibilities.  One is
that it is spam that wasn't caught by the recipes in sections 5
through 9.  The other is that either you subscribed to a mailing list
or have a new paranoid friend and didn't enter them in the recipes.

In either case the mail is put in a file called "unknown" in your mail
directory so you can view it.  You can also see the subject headers in
the splok/log file.
