SpamAssassin

From HerzbubeWiki
Jump to: navigation, search

Contents

Overview

Debian packages:

  • spamassassin
  • spamc

spamd is the daemon form of SpamAssassin. It listens on 127.0.0.1, port 783.

spamc is the client for spamd.


References

Daemon overview:

/usr/share/doc/spamassassin/README.spamd.gz

Introduction to the actual SpamAssassin program:

man spamassassin


Security

As long as no user-specific configuration is required, the daemon does not have to run as root. It can be started using the option -u <username>, e.g. with user spamassassin that was created manually.

As soon as a user-specific configuration is required (e.g. because users want to train their individual Bayes database), the daemon must run as root. Reason: the daemon needs to change uid so that it can read/write the user-specific files.

If user-specific configuration is maintained via MySQL or LDAP, the daemon does not need to run as root.


Configuration

spamd

The file /etc/default/spamassassin contains some essential configuration options for the spamd daemon:

  • ENABLED=1 makes sure that spamd is launched by its init script, e.g. when the system boots.
  • The OPTIONS string contains command line options for spamd:
    • --create-prefs = create a users preference file
    • --max-children 5 = the daemon should pre-fork 5 child processes
    • --helper-home-dir = external programs that spamd launches (e.g. Razor) should get the HOME environment variable set to the value of this option; if the option has no value, the value of HOME is taken from the spamc process that contacted the daemon
  • CRON=1 enables SpamAssassin to automatically update its rules on a daily basis


SpamAssassin

Source for information in this chapter is man spamassassin, chapter "Configuration Files".

  • SpamAssassin first loads its default configuration (i.e. the factory settings) from one of several pre-defined locations. On my Debian box the directory is
/usr/share/spamassassin
  • Next it uses site-specific configuration data to override previously set values. Again, a number of pre-defined locations are tried, but on my machine the configuration will be taken from
/etc/spamassassin
  • Finally, individual user preferences are loaded that will again override any previously set values. The preferences are taken from the location specified on the command line, or from the following file if nothing is specified on the command line
~/.spamassassin/user_prefs


Default configuration

The base configuration is located in

/usr/share/spamassassin

The directory contains a number of files that are loaded in a pre-defined order (see man page).


Site-specific configuration

Site-specific configuration is taken from

/etc/spamassassin

Notes:

  • First all files ending in .pre are read in lexical order.
  • Next all files ending in .cf are read, again in lexical order
  • The convention seems to be to load plugins in .pre files, and to set configuration options in .cf files

I have made the following changes

  • local.pre
# Auto-whitelist (AWL) has been disabled by default since 3.3.0, so we enable it explicitly
loadplugin Mail::SpamAssassin::Plugin::AWL
  • local.cf
    • None


Besides my own changes, the following important/interesting defaults apply (for details see man spamassassin, chapter "TAGGING"):

  • Any existing headers beginning with "X-Spam-" are removed to prevent spammer mischief and also to avoid potential problems caused by prior invocations of SpamAssassin.
  • Messages with score 5.0 or higher are classified as spam
  • If an incoming message is classified as spam, instead of modifying the original message, SpamAssassin will create a new report message and attach the original message as a message/rfc822 MIME part (ensuring the original message is completely preserved and easier to recover)
  • The new report message inherits some headers (if they are present) from the original spam message (e.g. From:, To:)
  • A spam message gets this header: X-Spam-Flag: YES
    • Although this is not documented, I had one genuine ham message that received the header X-Spam-Flag: No
  • All messages (regardless of whether they are classified as spam or ham), get these headers:
    • X-Spam-Checker-Version:
    • X-Spam-Level:
    • X-Spam-Status:


Individual user preferences

Individual user preferences are loaded from the location specified on the spamassassin, sa-learn, or spamd command line (see respective manual page for details). If the location is not specified, ~/.spamassassin/user_prefs is used if it exists. SpamAssassin will create that file if it does not already exist, using /usr/share/spamassassin/user_prefs.template as a template. The regular template file on my Debian system contains only comment lines.

The ~/.spamassassin directory contains files with the following user-specific information:

  • The auto-whitelist (aka automatic whitelist or AWL) database: a list that tracks scores for regular correspondents, i.e. the scores of all messages that someone has sent in the past are averaged, and the scores of any messages in the future are pushed towards that average
  • The Bayes database: is updated when messages are auto-learned as spam, or when explicit training with sa-learn occurs


Integration with Exim

Overview

Regular spam checking is done inside an ACL. There is also an alternative that employs a special router and transport, but this is more complicated to understand and not as efficient.


ACL approach

The ACL that is being referenced by acl_smtp_data (i.e. the ACL that is executed after the DATA command and its data have been received during the SMTP dialog) may be extended with the ACL condition spam. For instance:

warn
  spam = <username>

The condition is available only when Exim has been compiled with the so-called exiscan patch. In Debian, you have to use the package exim4-daemon-heavy to get the patch.

Details about the operations performed by the spam condition can be looked up in the Exim docs in chapter 40.2 ("Scanning with SpamAssassin").

Essentially the condition contacts the spamd daemon (entirely leaving out spamc), providing the daemon with the message to scan. When the condition "returns" the message is still in its original form, i.e. SpamAssassin did not add any headers. Instead the spam condition sets up a number of expansion variables that can be used to add the spam headers to the message inside the ACL. Also, the condition returns true if SpamAssassin has classified the message as spam.

The following expansion variables are set up:

  • $spam_score: The spam score of the message, for example “3.4” or “30.5”. This is useful for inclusion in log or reject messages.
  • $spam_score_int: The spam score of the message, multiplied by ten, as an integer value. For example “34” or “305”. This is useful for numeric comparisons in conditions. This variable is special; it is saved with the message, and written to Exim's spool file. This means that it can be used during the whole life of the message on your Exim system, in particular, in routers or transports during the later delivery phase.
  • $spam_bar: A string consisting of a number of “+” or “-” characters, representing the integer part of the spam score value. A spam score of 4.4 would have a $spam_bar value of “++++”. This is useful for inclusion in warning headers, since MUAs can match on such strings.
  • $spam_report: A multiline text table, containing the full SpamAssassin report for the message. Useful for inclusion in headers or reject messages.


Usually the expansion variables will be used to fill some mail headers. An example that tries to "simulate" the headers added by a default SpamAssassin configuration might look like this:

# Perform classification
warn
  set acl_m9  = ham
  spam        = $acl_m8
  set acl_m9  = spam

# Add "X-Spam-Flag:" header only if message was spam
warn
  condition  = ${if {eq {$acl_m9}{spam}} {true}{false}}
  message    = X-Spam-Flag: YES

# Add additional headers regardless of message classification
warn
  message    = X-Spam-Checker-Version: ???
  message    = X-Spam-Level: $spam_score ($spam_bar)
  message    = X-Spam-Status: $spam_report

How does the $acl_m8 variable get its value?

  • some ACL (probably the one executed during RCPT) performs recipient verification using verify = recipient
  • this results in the execution of the verification router router_localuser_verify (discussed in the "Virtual domains" solution on the Exim page)
  • that verification router sets address_data = $local_part
  • after exeuction of the verification router has finished, the ACL transfers the value of $address_data into $acl_m8 because $address_data loses its value after the verification process ends


Router/Transport approach

Overview

The solution using a router/transport combination basically works in the following way:

  1. a message is received by the Exim MTA
  2. the Exim MTA passes the message to a special transport that invokes the pipe spamc | exim4 and passes the message to stdin of spamc
  3. spamc contacts spamd on 127.0.0.1, port 783, and passes the message over the network connection
  4. spamd invokes SpamAssassin which classifies the message and adds mail headers with the classification results to the message
  5. spamd passes the modified message back to spamc over the network connection
  6. spamc prints the modified message to stdout, which is then passed to exim4 over the pipe
  7. exim4 re-submits the modified message to the Exim MTA, using a special protocol with the custom name spam-scanned
  8. the Exim MTA performs final routing/transport to the re-submitted message (re-submitted messages are distinguished from originally received ones by the spam-scanned protocol)


Router

router_localuser_spamcheck:
  driver = accept
  transport = transport_localuser_spamcheck
  no_verify
  condition = "${if and { {!eq {$received_protocol}{spam-scanned}} {!def:acl_m7} } {1}{0}}"
  domains = localhost

The router becomes active only for messages whose destination is localhost. This restriction can be set only because the router works together with the "virtual domains" solution presented on the Exim page:

  • the router router_virtualdomains transforms all addresses for virtual domains into local addresses
  • normally local addresses would be processed by router_localuser, but now router_localuser_spamcheck catches these addresses first
  • the process described in the "Overview" chapter above occurs
  • the re-submitted message then gets processed router_localuser

In addition to the domains = localhost restriction, the following conditions apply:

  • {!eq {$received_protocol}{spam-scanned}} = to prevent an infinite loop where the same message is spam-checked again and again
  • {!def:acl_m7} = spam checking has been disabled for some reason by an ACL (e.g. because the message was locally submitted)


Transport

transport_localuser_spamcheck:
  driver = pipe
  command = /usr/sbin/exim4 -oMr spam-scanned -bS
  transport_filter = /usr/bin/spamc
  use_bsmtp = true
  home_directory = "/tmp"
  current_directory = "/tmp"
  return_path_add = false
  log_output = true
  return_fail_output = true
  message_prefix =
  message_suffix =

The driver is pipe because the message should be passed on to an external program via a pipe.

The message is first passed to the external program defined in transport_filter, which in the present case is the spamc client. spamc is executed under the uid/gid that was defined by the router, i.e. a local user uid/gid. This leads to spamd using the configuration in that user's home directory (esp. the user's Bayes database).

The message that spamc returns is then passed to the external program defined in command. This leads to the the message being re-submitted:

  • -bS tells Exim to expect the message as "batched SMTP input"; see man exim4 for details
    • the transport option use_bsmtp tells the transport to write the message in "batch SMTP" format
  • -oMr spam-scanned tells Exim to set the $received_protocol variable to "spam-scanned" while the re-submitted message is processed
    • Exim only allows $received_protocol to be set by "privileged" users
    • therefore the exim4 process that processes the re-submitted message must run under a "privileged" uid/gid
    • according to the Exim docs the following users are privileged:
      • root
      • the user Debian-exim (or rather: the user that was defined in EXIM_USER when the package was compiled)
      • users that are listed in the main configuration option trusted_users
      • users that belong to groups that are listed in the main configuration option trusted_groups
    • in order for the presented transport to work, all local users must therefore be placed in the trusted_users list in the main configuration

return_fail_output is true so that if command returns with a non-zero exit code, the command's output is added to the bounce message.

return_path_add is set to false because the header will be added later when final delivery is made.

The message_prefix</tt and <tt>message_suffix options are explicitly unset in order to leave the message unchanged when it is re-submitted.


Delivery into junk mail folder

What we do not see in the above chapters (esp. in the transport description) is what the user does to the delivered message. Typically there will be some filter that moves spam into a special junk folder. This filter can be configured in a mail client, in my setup, however, the filter is part of my ~/.foward file. For instance:

# Most of the time ham mail does not contain the X-Spam-Flag header.
# However, I found one case where it was present and had the value
# "No". Spam mail, on the other hand, always has the header with
# the value "YES". The test for "Yes" is just to be on the safe side
# in case SpamAssassin devs one day decide to regularize the case of
# the header's value.
if "${if def:h_X-Spam-Flag {def}{undef}}" is "def" and
   ($h_X-Spam-Flag is "YES" or $h_X-Spam-Flag is "Yes") then

   save Maildir/.Junk/
endif


Training the Bayes database

Automatic training

By default SpamAssassin automatically trains the Bayes database with messages that it is currently classifying. On my system, the file /usr/share/spamassassin/10_default_prefs.cf sets the option bayes_auto_learn to 1.

Not all messages are automatically trained upon, though. The man page for Mail::SpamAssassin::Plugin::AutoLearnThreshold provides details:

  • a message is automatically trained only if
    • its score is below a certain threshold (in this case it is trained as ham)
    • its score is above a certain threshold (in this case it is trained as spam)
  • the score that is compared to the lower/upper thresholds does not contain certain tests
  • in order to train a message as spam, the message must score at least 3 points in the header and at least 3 points in the body

The defaults for the thresholds are 0.1 (lower) and 12 (upper).


Manual training

Manual training is done by the sa-learn program. See the man page for details.

Basic invokation:

sa-learn --spam /path/to/mail/folder
sa-learn --ham /path/to/mail/folder

To train a single message:

cat messagefile | sa-learn --ham


Notes about the training process:

  • re-training a message as ham that was previously trained as spam (and vice versa) makes SpamAssassin forget about the previous training
  • 1000 messages each for ham and spam is the minium required for successful training; training less messages works, too, but the results are not satisfying
  • training more than 5000 messages does not much increase the quality of results
  • training should be done with current spam messages since spam continually changes its appearance
  • the age of ham and spam messages being trained should be about the same; training old spam but new ham leads to messages with an old timestamp to be classified as spam - even if they are ham
  • the quantity of ham messages being trained should be larger than the quantity of spam messages; if the quantity is less results may not be satisfying (esp. if only a few ham messages are trained)


cron

Users are given the opportunity to train their individual spam filter. A periodical cron job run in every user's context scans two pre-defined IMAP folders from which it learns ham and spam messages. The folders are

~/Maildir/.Junk/Training-ham
~/Maildir/.Junk/Training-spam

The user may place any messages she wants into these folders. Usually such messages will be false-positives and/or false-negatives.

The script looks like this:

#!/bin/bash

# ------------------------------------------------------------
# Arguments
#  None
#
# Exit codes
#  0 = ok
#  1 = this program is already running for the current user
#  2 = a prerequisite could not be found
#  3 = some error related to temporary files occurred
# ------------------------------------------------------------


# ------------------------------------------------------------
# Initialize variables

# Maildirs
MAIL_DIR="Maildir"
TRAINING_HAM_DIR=".Junk.Training-ham"
TRAINING_SPAM_DIR=".Junk.Training-spam"
TRAINED_AS_HAM_DIR=".Junk.Trained-as-ham"
TRAINED_AS_SPAM_DIR=".Junk.Trained-as-spam"

# Programs
SA_LEARN_BIN=/usr/bin/sa-learn
SPAMC_BIN=/usr/bin/spamc
MAILDROP_BIN=/usr/bin/maildrop
RM_BIN=/bin/rm
MV_BIN=/bin/mv

# Other variables
TMP_DIR="/tmp/$$"
MAILDROP_FILTER_HAM="$TMP_DIR/maildrop.ham"
MAILDROP_FILTER_SPAM="$TMP_DIR/maildrop.spam"
LOCK_FILE="$HOME/$(basename $0).$LOGNAME.pid"

# ------------------------------------------------------------
# Check if this program is already running for the current user
if test -f "$LOCK_FILE"; then
  exit 1
fi

# ------------------------------------------------------------
# Sanity checks
#for BIN in "$SA_LEARN_BIN" "$SPAMC_BIN" "$MAILDROP_BIN" "$RM_BIN" "$MV_BIN"
for BIN in "$SA_LEARN_BIN" "$SPAMC_BIN" "$RM_BIN" "$MV_BIN"
do
  which "$BIN" >/dev/null 2>&1
  if test $? -ne 0; then
    echo "$BIN could not be found"
    exit 2
  fi
done

# ------------------------------------------------------------
# Setup temporary directory and files within
if test -d $TMP_DIR; then
  echo "Temporary directory $TMP_DIR already exists"
  exit 3
fi
mkdir -p $TMP_DIR
if test $? -ne 0; then
  echo "Could not create temporary directory $TMP_DIR"
  exit 3
fi
echo "to \"\$HOME/$MAIL_DIR/$TRAINED_AS_HAM_DIR\"" >$MAILDROP_FILTER_HAM
echo "to \"\$HOME/$MAIL_DIR/$TRAINED_AS_SPAM_DIR\"" >$MAILDROP_FILTER_SPAM

# ------------------------------------------------------------
# Create lock file. From now on, do not return without removing the file
echo $$ >"$LOCK_FILE"

# ------------------------------------------------------------
# Process all messages
for MESSAGE_TYPE in ham spam
do
  if test "$MESSAGE_TYPE" = "ham"; then
    SRC_BASE_DIR="$HOME/$MAIL_DIR/$TRAINING_HAM_DIR"
    DST_BASE_DIR="$HOME/$MAIL_DIR/$TRAINED_AS_HAM_DIR"
    MAILDROP_FILTER="$MAILDROP_FILTER_HAM"
  elif test "$MESSAGE_TYPE" = "spam"; then
    SRC_BASE_DIR="$HOME/$MAIL_DIR/$TRAINING_SPAM_DIR"
    DST_BASE_DIR="$HOME/$MAIL_DIR/$TRAINED_AS_SPAM_DIR"
    MAILDROP_FILTER="$MAILDROP_FILTER_SPAM"
  else
    continue
  fi

  # Learn messages, then move them to different folder
  for SUB_DIR in new cur
  do
    SRC_DIR="$SRC_BASE_DIR/$SUB_DIR"
    DST_DIR="$DST_BASE_DIR/$SUB_DIR"
    if test ! -d "$SRC_DIR" -o ! -d "$DST_DIR"; then
      continue
    fi

    # Learn/re-learn messages
    $SA_LEARN_BIN "--$MESSAGE_TYPE" "$SRC_DIR" 2>&1 | logger

    # 1) Let spamc re-classify message - the message has been learned as the correct
    #    type, so the re-classification should give the correct result; the purpose
    #    of re-classification is to add the correct mail headers to the message, also
    #    removing any wrong headers from a previous classification
    # 2) Use maildrop to deliver the cleaned-up message to the final mailbox folder
    # 3) Remove the original message
#    find "$SUB_DIR" -type f -exec bash -c "$SPAMC_BIN <{} | "$MAILDROP_BIN" $MAILDROP_FILTER; $RM_BIN -f {}" \;
#    find "$SRC_DIR" -type f -exec bash -c "$MV_BIN {} $DST_DIR" \;
  done
done

# ------------------------------------------------------------
# Cleanup
rm -rf "$TMP_DIR"
rm -f "$LOCK_FILE"


cron, approach 2

The following command line fetches all email in a given mailbox directory and stores it in a file. The messages are modified with an additional "Received:" header.

/usr/bin/fetchmail -a -v -v -n -p IMAP --folder 'INBOX.aaa' -u patrick -m 'bash -c "/usr/bin/tee >/tmp/aaa.test"' localhost

The following command line delivers a mail message in a given file to a given directory. The command must be run as the user to whose mailbox delivery should take place.

cat /tmp/aaa.test | maildrop  /tmp/maildroprc.aaa

The maildrop filter file looks like this, it must belong to the user for whom delivery takes place:

to "$HOME/Maildir/.aaa"
Personal tools
Namespaces

Variants
Actions
Navigation
Software Engineering
System Administration
Lists
Tools