SpamAssassin

From HerzbubeWiki
Jump to navigation Jump to search

This page is about the email spam filtering software SpamAssassin.


Debian packages

spamassassin
spamc

spamd is the daemon form of SpamAssassin. It listens on 127.0.0.1, port 783.

spamc is the client for spamd.


References

Daemon overview:

/usr/share/doc/spamassassin/README.spamd.gz

Introduction to the actual SpamAssassin program:

man spamassassin


Security

As long as no user-specific configuration is required, the daemon does not have to run as root. It can be started using the option -u <username>, e.g. with user spamassassin that was created manually.

As soon as a user-specific configuration is required (e.g. because users want to train their individual Bayes database), the daemon must run as root. Reason: the daemon needs to change uid so that it can read/write the user-specific files.

If user-specific configuration is maintained via MySQL or LDAP, the daemon does not need to run as root.


Configuration

spamd

The file /etc/default/spamassassin contains some essential configuration options for the spamd daemon:

  • The OPTIONS string contains command line options for spamd:
    • --create-prefs = create a users preference file
    • --max-children 5 = the daemon should pre-fork 5 child processes
    • --helper-home-dir = external programs that spamd launches (e.g. Razor) should get the HOME environment variable set to the value of this option; if the option has no value, the value of HOME is taken from the spamc process that contacted the daemon
  • CRON=1 enables SpamAssassin to automatically update its rules on a daily basis


SpamAssassin

Source for information in this chapter is man spamassassin, chapter "Configuration Files".

  • SpamAssassin first loads its default configuration (i.e. the factory settings) from one of several pre-defined locations. On my Debian box the directory is
/usr/share/spamassassin
  • Next it uses site-specific configuration data to override previously set values. Again, a number of pre-defined locations are tried, but on my machine the configuration will be taken from
/etc/spamassassin
  • Finally, individual user preferences are loaded that will again override any previously set values. The preferences are taken from the location specified on the command line, or from the following file if nothing is specified on the command line
~/.spamassassin/user_prefs


Default configuration

The base configuration is located in

/usr/share/spamassassin

The directory contains a number of files that are loaded in a pre-defined order (see man page).


Site-specific configuration

Site-specific configuration is taken from

/etc/spamassassin

Notes:

  • First all files ending in .pre are read in lexical order.
  • Next all files ending in .cf are read, again in lexical order
  • The convention seems to be to load plugins in .pre files, and to set configuration options in .cf files
  • Local changes are intended to go into the files local.pre and local.cf.


The following important/interesting defaults apply (for details see man spamassassin, chapter "TAGGING"):

  • Any existing headers beginning with "X-Spam-" are removed to prevent spammer mischief and also to avoid potential problems caused by prior invocations of SpamAssassin.
  • Messages with score 5.0 or higher are classified as spam
  • If an incoming message is classified as spam, instead of modifying the original message, SpamAssassin will create a new report message and attach the original message as a message/rfc822 MIME part (ensuring the original message is completely preserved and easier to recover)
  • The new report message inherits some headers (if they are present) from the original spam message (e.g. From:, To:)
  • A spam message gets this header: X-Spam-Flag: YES
    • Although this is not documented, I had one genuine ham message that received the header X-Spam-Flag: No
  • All messages (regardless of whether they are classified as spam or ham), get these headers:
    • X-Spam-Checker-Version:
    • X-Spam-Level:
    • X-Spam-Status:


On my machine, the file local.pre does not exist. The file local.cf does exist, but is mostly a copy of /usr/share/spamassassin/local.cf. The changes I made are at the bottom:

# Use the local DNS server which is set up to query DNSBL services
# directly instead of going through the green.ch DNS servers.
dns_server 127.0.0.1

Notes:

  • In order for this to work, a local DNS server needs to be running, and it needs to be configured so as not to forward queries destined for DNSBL services to the green.ch DNS servers. See the BIND wiki page for details.
  • If you see URIBL_BLOCKED in the classification report generated by SpamAssassin, then something about the DNS server is not working as expected.


Individual user preferences

Individual user preferences are loaded from the location specified on the spamassassin, sa-learn, or spamd command line (see respective manual page for details). If the location is not specified, ~/.spamassassin/user_prefs is used if it exists. SpamAssassin will create that file if it does not already exist, using /usr/share/spamassassin/user_prefs.template as a template. The regular template file on my Debian system contains only comment lines.

The ~/.spamassassin directory contains files with the following user-specific information:

  • The auto-whitelist (aka automatic whitelist or AWL) database: a list that tracks scores for regular correspondents, i.e. the scores of all messages that someone has sent in the past are averaged, and the scores of any messages in the future are pushed towards that average.
    • Auto-whitelisting requires the AWL SpamAssassin plugin to be active, i.e. somewhere in the configuration this command must appear: loadplugin Mail::SpamAssassin::Plugin::AWL.
    • For some time SpamAssassin had the plugin enabled by default, then it was disabled. For a while I then had the plugin enabled locally (in /etc/spamassassin/local.pre), but this is no longer the case.
    • Although in theory the plugin sounds useful, in practice I didn't have any problems with it being disabled, so at the moment it stays disabled.
  • The Bayes database: is updated when messages are auto-learned as spam, or when explicit training with sa-learn occurs.


Plugins

Plugins are enabled in the site-specific configuration in /etc/spamassassin. Plugins are documented in man pages, so e.g.

man Mail::SpamAssassin::Plugin::AutoLearnThreshold

The following command lists all plugins that are currently enabled:

grep -r loadplugin /etc/spamassassin/* | sed -e 's/^[^:]*://' | grep -v ^# | sed -e 's/^loadplugin //' | sort

In my case these are

Mail::SpamAssassin::Plugin::AskDNS
Mail::SpamAssassin::Plugin::AutoLearnThreshold
Mail::SpamAssassin::Plugin::Bayes
Mail::SpamAssassin::Plugin::BodyEval
Mail::SpamAssassin::Plugin::Check
Mail::SpamAssassin::Plugin::DKIM
Mail::SpamAssassin::Plugin::DNSEval
Mail::SpamAssassin::Plugin::FreeMail
Mail::SpamAssassin::Plugin::Hashcash
Mail::SpamAssassin::Plugin::HeaderEval
Mail::SpamAssassin::Plugin::HTMLEval
Mail::SpamAssassin::Plugin::HTTPSMismatch
Mail::SpamAssassin::Plugin::ImageInfo
Mail::SpamAssassin::Plugin::MIMEEval
Mail::SpamAssassin::Plugin::MIMEHeader
Mail::SpamAssassin::Plugin::Pyzor
Mail::SpamAssassin::Plugin::Razor2
Mail::SpamAssassin::Plugin::RelayEval
Mail::SpamAssassin::Plugin::ReplaceTags
Mail::SpamAssassin::Plugin::Rule2XSBody
Mail::SpamAssassin::Plugin::SpamCop
Mail::SpamAssassin::Plugin::SPF
Mail::SpamAssassin::Plugin::URIDetail
Mail::SpamAssassin::Plugin::URIDNSBL
Mail::SpamAssassin::Plugin::URIEval
Mail::SpamAssassin::Plugin::VBounce
Mail::SpamAssassin::Plugin::WhiteListSubject
Mail::SpamAssassin::Plugin::WLBLEval


How scoring works

Classification report

The overall score of a message is the sum of the scores of all tests that match the message when it is classified by SpamAssassin. These scores can be seen in the report that SpamAssassin generates when it classifies a message. Typically this report is part of the X-Spam-Status header added to the classified message. Example:

Content analysis details:   (4.8 points, 5.0 required)

 pts rule name              description
---- ---------------------- --------------------------------------------------
 0.1 RCVD_IN_SBL            RBL: Received via a relay in Spamhaus SBL
                            [95.214.27.225 listed in zen.spamhaus.org]
 0.2 BAYES_999              BODY: Bayes spam probability is 99.9 to 100%
                            [score: 1.0000]
 3.5 BAYES_99               BODY: Bayes spam probability is 99 to 100%
                            [score: 1.0000]
-0.0 SPF_HELO_PASS          SPF: HELO matches SPF record
-0.0 SPF_PASS               SPF: sender matches SPF record
 0.1 URI_HEX                URI: URI hostname has long hexadecimal sequence
 0.1 MIME_HTML_ONLY         BODY: Message only has text/html MIME parts
 0.0 HTML_MESSAGE           BODY: HTML included in message
 0.1 DKIM_SIGNED            Message has a DKIM or DK signature, not necessarily
                            valid
-0.1 DKIM_VALID_EF          Message has a valid DKIM or DK signature from
                            envelope-from domain
-0.1 DKIM_VALID             Message has at least one valid DKIM or DK signature
-0.1 DKIM_VALID_AU          Message has a valid DKIM or DK signature from
                            author's domain
 1.0 ACCT_PHISHING_MANY     Phishing for account information


Test scores

The scores assigned to a test are found in the SpamAssassin configuration. On my system, the scores are assigned in the file

/usr/share/spamassassin/50_scores.cf


The way how the score configuration works is documented in the man page for Mail::SpamAssassin::Conf in the section "SCORING OPTIONS". In short, a score line looks like this:

score BAYES_99  0  0  3.8    3.5

Notes:

  • score is the keyword
  • BAYES_99 is the symbolic test name
  • There can be either 1 or 4 numbers. If there is only 1 number it is used as the score for the test in all scenarios. If there are 4 numbers they are used as the score for the test depending on the test scenario.
    • Number 1: Bayes and network tests are both disabled
    • Number 2: Bayes tests are disabled, network tests are enabled
    • Number 3: Bayes tests are enabled, network tests are disabled
    • Number 4: Bayes and network tests are both enabled
  • For additional details see man Mail::SpamAssassin::Conf.


Bayes tests and their scores

Bayes test scores in /usr/share/spamassassin/50_scores.cf and Bayes tests in /usr/share/spamassassin/23_bayes.cf:

# Scores in 50_scores.cf
score BAYES_00  0  0 -1.5   -1.9
score BAYES_05  0  0 -0.3   -0.5
score BAYES_20  0  0 -0.001 -0.001
score BAYES_40  0  0 -0.001 -0.001
score BAYES_50  0  0  2.0    0.8
score BAYES_60  0  0  2.5    1.5
score BAYES_80  0  0  2.7    2.0
score BAYES_95  0  0  3.2    3.0
score BAYES_99  0  0  3.8    3.5
score BAYES_999 0  0  0.2    0.2
endif

# Tests in 23_byes.cf

body BAYES_00           eval:check_bayes('0.00', '0.01')
body BAYES_05           eval:check_bayes('0.01', '0.05')
body BAYES_20           eval:check_bayes('0.05', '0.20')
body BAYES_40           eval:check_bayes('0.20', '0.40')

# note: tread carefully around 0.5... the Bayesian classifier
# will use that for anything it's unsure about, or if it's untrained.
body BAYES_50           eval:check_bayes('0.40', '0.60')

body BAYES_60           eval:check_bayes('0.60', '0.80')
body BAYES_80           eval:check_bayes('0.80', '0.95')
body BAYES_95           eval:check_bayes('0.95', '0.99')
body BAYES_99           eval:check_bayes('0.99', '1.00')

#Additional rule to add more of a score to BAYES_99 FOR 99.9% to 100%
body BAYES_999          eval:check_bayes('0.999', '1.00')

Notes:

  • Each Bayes test is assigned a fixed score - contrary to intuition, a message that has a high probability for being spam will not automatically receive a higher score for a given test.
  • As can be seen from the content of 23_byes.cf, the way how the Bayes tests work is that each test is assigned a non-overlapping probability range. This means that a message will be matched by only one of the tests, and that test's score is then used.
  • The only exception is the test BAYES_999, which has an overlap with BAYES_99. So a message that matches BAYES_999 will also match BAYES_99 and will therefore receive two scores.
  • The maximum score that a message can achieve from Bayes tests, using the score values from the SpamAssassin default configuration, is 3.8 + 0.2 = 4.0.


Conclusions:

  • In order to be classified as spam, a message also needs to match other tests for an additional total score of at least 1.0.
  • If a false-negative message already matches the BAYES_999 test, there is probably not much point in putting it through sa-learn anymore.


Testing

The following command can be used to run a message through SpamAssassin to see whether it would be classified as ham or spam:

cat /path/to/message-file | spamc --full --username=foo

Notes:

  • The message file is a file located in a Maildir folder
  • --full causes the SpamAssassin report text to be printed instead of the rewritten message (as would normally be required by an MTA such as Exim).
  • --username is required only if you run spamc with a different user than the one whose configuration and Bayes database should be used for the classification.


To see just the score without a report, replace --full with --check:

cat /path/to/message-file | spamc --check --username=foo


Integration with Exim

Summary

Regular spam checking is done inside an ACL.

There is also an alternative that employs a special router and transport, but this is more complicated to understand and not as efficient. If you are interested in details, check out the history of this wiki page and view an older version of the page.


Delivery into junk mail folder

Regardless of which approach is used to check messages for spam content (ACL or router/transport approach), a message that has been detected to contain spam must be processed by some filter mechanism so that it is delivered into a special junk mail folder instead of to the regular inbox. This filter can be configured in a mail client (MUA), but in my setup the filter is part of the user's ~/.foward file.


Here's an excerpt from a .forward file that illustrates the mechanism:

# Most of the time ham mail does not contain the X-Spam-Flag header.
# However, I found one case where it was present and had the value
# "No". Spam mail, on the other hand, always has the header with
# the value "YES". The test for "Yes" is just to be on the safe side
# in case SpamAssassin devs one day decide to regularize the case of
# the header's value.
if "${if def:h_X-Spam-Flag {def}{undef}}" is "def" and
   ($h_X-Spam-Flag is "YES" or $h_X-Spam-Flag is "Yes") then

  if $h_to: is "lonelyplanet@herzbube.ch" or
     $h_to: is "paypal@herzbube.ch" or
     $h_to: is "directories.ch@herzbube.ch" or
     $h_to: is "realplayer@herzbube.ch" then
    save Maildir/.Junk.spamtrap/
  elif $h_to: contains iana.pen or
       $h_Envelope-to: contains iana.pen then
    save Maildir/.Junk.spamtrap.ianapen/
  else
    save Maildir/.Junk.Incoming/
  endif
endif


ACL approach

The ACL that is being referenced by acl_smtp_data (i.e. the ACL that is executed after the DATA command and its data have been received during the SMTP dialog) must be extended with the ACL condition spam. For instance:

warn
  spam = <username>

The condition is available only when Exim has been compiled with the so-called exiscan patch. In Debian, you have to use the package exim4-daemon-heavy to get the patch.

Details about the operations performed by the spam condition can be looked up in the Exim docs in chapter 40.2 ("Scanning with SpamAssassin").

Essentially the condition contacts the spamd daemon (entirely leaving out spamc), providing the daemon with the message to scan. When the condition "returns" the message is still in its original form, i.e. SpamAssassin did not add any headers. Instead the spam condition sets up a number of expansion variables that can be used to add the spam headers to the message inside the ACL. Also, the condition returns true if SpamAssassin has classified the message as spam.

The following expansion variables are set up:

  • $spam_score: The spam score of the message, for example “3.4” or “30.5”. This is useful for inclusion in log or reject messages.
  • $spam_score_int: The spam score of the message, multiplied by ten, as an integer value. For example “34” or “305”. This is useful for numeric comparisons in conditions. This variable is special; it is saved with the message, and written to Exim's spool file. This means that it can be used during the whole life of the message on your Exim system, in particular, in routers or transports during the later delivery phase.
  • $spam_bar: A string consisting of a number of “+” or “-” characters, representing the integer part of the spam score value. A spam score of 4.4 would have a $spam_bar value of “++++”. This is useful for inclusion in warning headers, since MUAs can match on such strings.
  • $spam_report: A multiline text table, containing the full SpamAssassin report for the message. Useful for inclusion in headers or reject messages.


Usually the expansion variables will be used to fill some mail headers. An example that tries to "simulate" the headers added by a default SpamAssassin configuration might look like this:

# Perform classification
warn
  set acl_m9  = ham
  spam        = $acl_m8
  set acl_m9  = spam

# Add "X-Spam-Flag:" header only if message was spam
warn
  condition  = ${if {eq {$acl_m9}{spam}} {true}{false}}
  message    = X-Spam-Flag: YES

# Add additional headers regardless of message classification
warn
  message    = X-Spam-Checker-Version: ???
  message    = X-Spam-Level: $spam_score ($spam_bar)
  message    = X-Spam-Status: $spam_report

How does the $acl_m8 variable get its value?

  • Some ACL (probably the one executed during RCPT) performs recipient verification using verify = recipient
  • This results in the execution of the verification router router_localuser_verify (discussed in the "Virtual domains" section on the Exim page)
  • That verification router sets address_data = $local_part
  • After execution of the verification router has finished, the ACL transfers the value of $address_data into $acl_m8 because $address_data loses its value after the verification process ends


Training the Bayes database

Automatic training

By default SpamAssassin automatically trains the Bayes database with messages that it is currently classifying. On my system, the file /usr/share/spamassassin/10_default_prefs.cf sets the option bayes_auto_learn to 1.

Not all messages are automatically trained upon, though. The man page for Mail::SpamAssassin::Plugin::AutoLearnThreshold provides details:

  • A message is automatically trained only if
    • Its score is below a certain threshold (in this case it is trained as ham)
    • Its score is above a certain threshold (in this case it is trained as spam)
  • The score that is compared to the lower/upper thresholds does not contain certain tests
  • In order to train a message as spam, the message must score at least 3 points in the header and at least 3 points in the body

The defaults for the thresholds are 0.1 (lower) and 12 (upper). On my system the file /usr/share/spamassassin/10_default_prefs.cf contains these default values.


Manual training

Manual training is done by the sa-learn program. See the man page for details.

Basic invokation:

sa-learn --spam /path/to/mail/folder
sa-learn --ham /path/to/mail/folder

To train a single message:

cat messagefile | sa-learn --ham


Notes about the training process:

  • Re-training a message as ham that was previously trained as spam (and vice versa) makes SpamAssassin forget about the previous training
  • 1000 messages each for ham and spam is the minium required for successful training; training less messages works, too, but the results are not satisfying
  • Training more than 5000 messages does not much increase the quality of results
  • Training should be done with current spam messages since spam continually changes its appearance
  • The age of ham and spam messages being trained should be about the same; training old spam but new ham leads to messages with an old timestamp to be classified as spam - even if they are ham
  • The quantity of ham messages being trained should be larger than the quantity of spam messages; if the quantity is less results may not be satisfying (esp. if only a few ham messages are trained)


cron

Users are given the opportunity to train their individual spam filter. A periodical cron job run in every user's context scans two pre-defined IMAP folders from which it learns ham and spam messages. The folders are

~/Maildir/.Junk/Training-ham
~/Maildir/.Junk/Training-spam

The user may place any messages she wants into these folders. Usually such messages will be false-positives and/or false-negatives.


This is the cron job definition:

root@pelargir:~# cat /etc/cron.d/pelargir-sa-learn
0 * * * *   patrick     /usr/local/htb/bin/htb-sa-learn.sh
0 * * * *   francesca   /usr/local/htb/bin/htb-sa-learn.sh


The script htb-sa-learn.sh is part of my "herzbube's tool box", you can find it here. A script that does the same but can be run standalone is this one:

#!/bin/bash

# ------------------------------------------------------------
# Arguments
#  None
#
# Exit codes
#  0 = ok
#  1 = this program is already running for the current user
#  2 = a prerequisite could not be found
# ------------------------------------------------------------


# ------------------------------------------------------------
# Initialize variables

# Maildirs
MAIL_DIR="Maildir"
TRAINING_HAM_DIR=".Junk.Training-ham"
TRAINING_SPAM_DIR=".Junk.Training-spam"
TRAINED_AS_HAM_DIR=".Junk.Trained-as-ham"
TRAINED_AS_SPAM_DIR=".Junk.Trained-as-spam"

# Programs
SA_LEARN_BIN=/usr/bin/sa-learn

# Other variables
LOCK_FILE="$HOME/$(basename $0).$LOGNAME.pid"

# ------------------------------------------------------------
# Check if this program is already running for the current user
if test -f "$LOCK_FILE"; then
  exit 1
fi

# ------------------------------------------------------------
# Sanity checks
for BIN in "$SA_LEARN_BIN"
do
  which "$BIN" >/dev/null 2>&1
  if test $? -ne 0; then
    echo "$BIN could not be found"
    exit 2
  fi
done

# ------------------------------------------------------------
# Create lock file. From now on, do not return without removing the file
echo $$ >"$LOCK_FILE"

# ------------------------------------------------------------
# Process all messages
for MESSAGE_TYPE in ham spam
do
  if test "$MESSAGE_TYPE" = "ham"; then
    TRAINING_BASE_DIR="$HOME/$MAIL_DIR/$TRAINING_HAM_DIR"
  elif test "$MESSAGE_TYPE" = "spam"; then
    TRAINING_BASE_DIR="$HOME/$MAIL_DIR/$TRAINING_SPAM_DIR"
  else
    continue
  fi

  # Learn messages, then move them to different folder
  for SUB_DIR in new cur
  do
    TRAINING_DIR="$TRAINING_BASE_DIR/$SUB_DIR"
    if test ! -d "$TRAINING_DIR"; then
      echo "Training directory not found: $TRAINING_DIR"
      continue
    fi

    # Learn/re-learn messages
    echo "Learning $MESSAGE_TYPE from $TRAINING_DIR for $USER" 2>&1 | logger
    $SA_LEARN_BIN "--$MESSAGE_TYPE" "$TRAINING_DIR" 2>&1 | logger
  done
done

# ------------------------------------------------------------
# Cleanup
rm -f "$LOCK_FILE"


cron, approach 2

The following command line fetches all email in a given mailbox directory and stores it in a file. The messages are modified with an additional "Received:" header.

/usr/bin/fetchmail -a -v -v -n -p IMAP --folder 'INBOX.aaa' -u patrick -m 'bash -c "/usr/bin/tee >/tmp/aaa.test"' localhost

The following command line delivers a mail message in a given file to a given directory. The command must be run as the user to whose mailbox delivery should take place.

cat /tmp/aaa.test | maildrop  /tmp/maildroprc.aaa

The maildrop filter file looks like this, it must belong to the user for whom delivery takes place:

to "$HOME/Maildir/.aaa"


Managing the Bayes database

Backup

This command backs up a user's Bayes database:

sudo -u <username> sh -c "sa-learn --backup >/tmp/bayes.db"


Restore

This command restores a user's Bayes database from a backup previously made with the --backup option:

sudo -u <username> sh -c "sa-learn --restore /tmp/bayes.db"


Clear

This command clears a user's Bayes database, for instance if wrong messages have been learnt over time and a re-training is needed:

sudo -u <username> sa-learn --clear

This essentially deletes the Bayes database files in ~/.spamassassin.


Writing custom rules

References


Where to place custom rules?

Rules that are to be applied site-wide:

/etc/mail/spamassassin/local.cf

Rules that are to be applied only for a specific user:

~/.spamassassin/user_prefs

Important: User-specific rules in user_prefs are ignored by spamd, unless the allow_user_rules option is added to local.cf. Add this option only if you trust your users to not attempt to break into the system via a SpamAssassin exploit.


Basic rule anatomy

A rule consists of up to 3 lines of text with the following syntax:

<what> <rule-name> <rule-definition>
score <rule-name> <score-definition>
describe <rule-name> <description>

Notes:

  • The <what> line assigns a rule definition to a rule.
    • <what> specifies what part of the email message should be examined by the rule. Example: header indicates that the rule should examine the email message headers. These are the possible values:
      • header: Rule examines the message headers by checking for matching/non-matching regex, checking for header presence, evaluating Perl code or checking Received headers against DNSBL.
      • body: Rule examines the message subject and the text of the message body by checking for matching regex or evaluating Perl code. All textual MIME parts are decoded, with HTML tags and line breaks removed.
      • rawbody: Rule examines the text of the message body by checking for matching regex or evaluating Perl code. All textual MIME parts are decoded, with HTML tags and line breaks retained.
      • full: Rule examines the undecoded message body including all MIME parts by checking for matching regex or evaluating Perl code.
      • uri: Rule examines the URIs in the message body by checking for matching regex.
      • uridnsbl: Rule examines the URIs in the message body by checking for an address in a DNS-based blacklist.
      • meta: The rule is a meta rule. See the section below on meta tests.
    • <rule-name> gives the rule a name of your choosing.
      • The convention is to write rule names in uppercase.
      • To distinguish local rules from the ones that are packaged with SpamAssassin, it is recommended to add a prefix such as "LOCAL_".
      • Names that start with double-underscore (__) indicate to SpamAssassin that the rule is a sub-rule that is part of a meta rule - see the section below on meta rules.
      • Names that start with "T_" indicate that the rule is a test rule that should run with score 0.01. This is used mainly by the SpamAssassin developers.
    • <rule-definition> contains the what the rule is actually supposed to do. The possible contents of a rule definition mostly depend on what was specified for <what>. See more specific sections below.
  • The score line assigns a score definition to a rule. This can be either 1 or 4 score values. For more details see the Test scores section elsewhere on this page.
    • A sub-rule does not (cannot?) have a score at all.
    • A test rule does not need a score definition, it automatically runs with score 0.01.
    • A rule with score 0 is not evaluated.
    • A rule with no score definition, and that is also neither a sub-rule nor a test rule, has a default score 1.0.
  • The describe line assigns a description text to a rule. The description text - which can consist of multiple words - will be placed into the verbose report, if verbose reports are used. The default setting in modern SpamAssassin versions is to use verbose reports for the body.


Header rule definitions

Specific header ("Subject") exists:

header FOO exists: Subject

Notes:

  • Header names are case insensitive.


Specific header ("Subject") matching/not matching a regex.

header FOO Subject =~ /\btest\b/i
header FOO Subject !~ /\btest\b/i

Notes:

  • Only the header value is used for regex matching. (TODO: really?)
  • Special header names
    • ALL: Examines all headers - see example below.
    • ToCc: Examines both the To: and Cc: headers.
    • MESSAGEID: Examines all Message-Id: headers.
    • EnvelopeFrom: Examines the address supplied in the SMTP MAIL FROM command if the MTA provides this information to SpamAssassin.
  • The regex syntax is the Perl syntax.
  • In the example \b indicates that a word-break (anything that is not an alphanumeric character or underscore) must exist, so the regex string "test" must be its own word and is not matched if it is just a substring.
  • The "i" at the end performs case insensitive regex matching.
  • A header that does not exist will not match any regular expression. Add [if-unset : STRING] behind the regex to make the rule match against the string literal STRING.
  • Special treatment for the "From:" and "To:" headers: You can write "From:name" / "To:name" and "From:addr" / "To:addr" to parse specific parts of the header. The name is the thing that, if present, mail clients will show instead of the address. Example: "Nice Name" ugly@address.com.


Any header matching a regex:

header FOO ALL =~ /\btest\b/im

Notes:

  • The keyword ALL indicates that regex matching should be applied to all headers.
  • The "m" at the end causes multiple regex matches to be performed, one for each header line. If you omit the "m" then just one regex match is performed over all the headers lines.
  • When ALL is used, the entire header line including the header name is used for the regex matching. (TODO: really? it's implied by SpamAssassin wiki docs)


Evaluating Perl code:

header FOO eval:<perl-code>

Notes:

  • <perl-code> is a Perl code snippet.
  • One example could be to invoke a funnction that is already provided by SpamAssassin. For instance, Mail::SpamAssassin::EvalTests provides all sorts of useful functions - check out the rules provides by SpamAssassin itself to get some inspiration.
  • Apparently also useful to perform check against DNS blacklists: check_rbl(), check_rbl_txt() and check_rbl_sub(). The details are beyond the scope of this page.


Body rule definitions

Preliminary note: Body tests are powerful but slow.

Regex string exists anywhere in the body:

body FOO /\btest\b/i
rawbody FOO /\btest\b/i
full FOO /\btest\b/i

Notes:

  • A body rule performs regex matching against a body as it would be likely to appear to a person reading the message in a text-based mail client.
    • All textual MIME components of the message are decoded.
    • HTML tags are removed.
    • The message is reformatted into paragraphs (text separated by multiple newlines), and newlines within paragraphs are removed.
    • The "Subject:" header is included as the first paragraph.
  • A rawbody rule performs regex matching against a body as it would be likely to appear to a person reading the message in an HTML-based mail client.
    • All textual MIME components of the message are decoded.
    • The message is split into lines based on the line breaks in the message.
    • The "Subject:" header is not included.
  • A full rule performs regex matching against the full text of a message.
    • All headers are included, along with all textual MIME components of the message body, but no decoding is performed.
    • The message is split into lines based on the line breaks in the message.
  • The regex matching is performed multiple times on the result of the above pre-processing procedure.
    • body rule: One regex match per paragraph.
    • rawbody rule: One regex match per line.
    • full rule: One regex match per header and per message line.


Body rules can also use evaluating Perl code instead of regex matching. Any of the regex examples above can be written like this:

body FOO eval:<perl-code>

Notes:

  • None.


URI rule definitions

Any URIs in the message matching a regex:

uri FOO /^mailto:spammer@spam.com$/

Notes:

  • An uri rule performs regex matching against all URIs in the message.
  • SpamAssassin creates a list of http, https, ftp, mailto, javascript and file URIs and transforms bare hostnames starting with www or ftp into appropriate URIs.


Beyond the scope of this page: If the plugin Mail::SpamAssassin::Plugin::URIDNSBL is loaded it enables writing rules with the uridnsbl directive. This takes each URI in the message, extracts the name of the host in the URI, looks up its IP address in DNS, and then checks the IP address against a specified DNSBL. Examples may be found in SpamAssassin's own rules.


Meta rule definitions

Meta rule examples:

header __SUB_1  [...]
body   __SUB_2  [...]
header __SUB_3  [...]

# Boolean meta rule: Matches if boolean condition is true (sub-rules values are treated as booleans)
meta   META1    (__SUB_1 && (__SUB_2 || ! __SUB_3))
score  META1    1.3
  1. Arithmetic meta rule: Matches if the arithmetic condition is true (sub-rules' values are 1 or 0)

meta META2 (((0.2 * __SUB_1) + (0.8 * __SUB_2) + (1.3 * __SUB_3)) > 2) score META2 1.3

Notes:

  • Meta rules are rules that are boolean or arithmetic combinations of other sub-rules.
  • A boolean meta rule matches when the boolean condition using sub-rules is true. You can use paranthesis and the logical operators && (logical-and), || (logical-or) and ! (logical-not).
  • An arithmetic meta rule matches when the arithmetic condition using sub-rules is true. The value of a sub-rule in an arithmetic meta rule is the true/false (1/0) value for whether or not the rule matched.


Versioning

TODO: Rules can be versioned, and rules can specify that they require specific SpamAssassin versions.


Checking rules syntax (aka "linting")

To verify that your custom rules are syntactically correct you can run

spamassassin --lint

This exits silently if everything was OK, otherwise it should give some useful diagnostics. Add the -D option to get even more diagnostics output.


Activating rules

To activate the rules - even if you just want to test them with spamc - you have to restart SpamAssassin:

systemctl restart spamassassin