From HerzbubeWiki
Jump to: navigation, search

Debian packages


spamd is the daemon form of SpamAssassin. It listens on, port 783.

spamc is the client for spamd.


Daemon overview:


Introduction to the actual SpamAssassin program:

man spamassassin


As long as no user-specific configuration is required, the daemon does not have to run as root. It can be started using the option -u <username>, e.g. with user spamassassin that was created manually.

As soon as a user-specific configuration is required (e.g. because users want to train their individual Bayes database), the daemon must run as root. Reason: the daemon needs to change uid so that it can read/write the user-specific files.

If user-specific configuration is maintained via MySQL or LDAP, the daemon does not need to run as root.



The file /etc/default/spamassassin contains some essential configuration options for the spamd daemon:

  • ENABLED=1 makes sure that spamd is launched by its init script, e.g. when the system boots.
  • The OPTIONS string contains command line options for spamd:
    • --create-prefs = create a users preference file
    • --max-children 5 = the daemon should pre-fork 5 child processes
    • --helper-home-dir = external programs that spamd launches (e.g. Razor) should get the HOME environment variable set to the value of this option; if the option has no value, the value of HOME is taken from the spamc process that contacted the daemon
  • CRON=1 enables SpamAssassin to automatically update its rules on a daily basis


Source for information in this chapter is man spamassassin, chapter "Configuration Files".

  • SpamAssassin first loads its default configuration (i.e. the factory settings) from one of several pre-defined locations. On my Debian box the directory is
  • Next it uses site-specific configuration data to override previously set values. Again, a number of pre-defined locations are tried, but on my machine the configuration will be taken from
  • Finally, individual user preferences are loaded that will again override any previously set values. The preferences are taken from the location specified on the command line, or from the following file if nothing is specified on the command line

Default configuration

The base configuration is located in


The directory contains a number of files that are loaded in a pre-defined order (see man page).

Site-specific configuration

Site-specific configuration is taken from



  • First all files ending in .pre are read in lexical order.
  • Next all files ending in .cf are read, again in lexical order
  • The convention seems to be to load plugins in .pre files, and to set configuration options in .cf files

I have made the following changes

  • local.pre
# Auto-whitelist (AWL) has been disabled by default since 3.3.0, so we enable it explicitly
loadplugin Mail::SpamAssassin::Plugin::AWL
    • None

Besides my own changes, the following important/interesting defaults apply (for details see man spamassassin, chapter "TAGGING"):

  • Any existing headers beginning with "X-Spam-" are removed to prevent spammer mischief and also to avoid potential problems caused by prior invocations of SpamAssassin.
  • Messages with score 5.0 or higher are classified as spam
  • If an incoming message is classified as spam, instead of modifying the original message, SpamAssassin will create a new report message and attach the original message as a message/rfc822 MIME part (ensuring the original message is completely preserved and easier to recover)
  • The new report message inherits some headers (if they are present) from the original spam message (e.g. From:, To:)
  • A spam message gets this header: X-Spam-Flag: YES
    • Although this is not documented, I had one genuine ham message that received the header X-Spam-Flag: No
  • All messages (regardless of whether they are classified as spam or ham), get these headers:
    • X-Spam-Checker-Version:
    • X-Spam-Level:
    • X-Spam-Status:

Individual user preferences

Individual user preferences are loaded from the location specified on the spamassassin, sa-learn, or spamd command line (see respective manual page for details). If the location is not specified, ~/.spamassassin/user_prefs is used if it exists. SpamAssassin will create that file if it does not already exist, using /usr/share/spamassassin/user_prefs.template as a template. The regular template file on my Debian system contains only comment lines.

The ~/.spamassassin directory contains files with the following user-specific information:

  • The auto-whitelist (aka automatic whitelist or AWL) database: a list that tracks scores for regular correspondents, i.e. the scores of all messages that someone has sent in the past are averaged, and the scores of any messages in the future are pushed towards that average
  • The Bayes database: is updated when messages are auto-learned as spam, or when explicit training with sa-learn occurs

Integration with Exim


Regular spam checking is done inside an ACL.

There is also an alternative that employs a special router and transport, but this is more complicated to understand and not as efficient. For historical reasons I like to keep the sections for this type of configuration around, but I don't maintain the information anymore.

Delivery into junk mail folder

Regardless of which approach is used to check messages for spam content (ACL or router/transport approach), a message that has been detected to contain spam must be processed by some filter mechanism so that it is delivered into a special junk mail folder instead of to the regular inbox. This filter can be configured in a mail client (MUA), but in my setup the filter is part of the user's ~/.foward file.

Here's an excerpt from a .forward file that illustrates the mechanism:

# Most of the time ham mail does not contain the X-Spam-Flag header.
# However, I found one case where it was present and had the value
# "No". Spam mail, on the other hand, always has the header with
# the value "YES". The test for "Yes" is just to be on the safe side
# in case SpamAssassin devs one day decide to regularize the case of
# the header's value.
if "${if def:h_X-Spam-Flag {def}{undef}}" is "def" and
   ($h_X-Spam-Flag is "YES" or $h_X-Spam-Flag is "Yes") then

  if $h_to: is "" or
     $h_to: is "" or
     $h_to: is "" or
     $h_to: is "" then
    save Maildir/.Junk.spamtrap/
  elif $h_to: contains iana.pen or
       $h_Envelope-to: contains iana.pen then
    save Maildir/.Junk.spamtrap.ianapen/
    save Maildir/.Junk.Incoming/

ACL approach

The ACL that is being referenced by acl_smtp_data (i.e. the ACL that is executed after the DATA command and its data have been received during the SMTP dialog) must be extended with the ACL condition spam. For instance:

  spam = <username>

The condition is available only when Exim has been compiled with the so-called exiscan patch. In Debian, you have to use the package exim4-daemon-heavy to get the patch.

Details about the operations performed by the spam condition can be looked up in the Exim docs in chapter 40.2 ("Scanning with SpamAssassin").

Essentially the condition contacts the spamd daemon (entirely leaving out spamc), providing the daemon with the message to scan. When the condition "returns" the message is still in its original form, i.e. SpamAssassin did not add any headers. Instead the spam condition sets up a number of expansion variables that can be used to add the spam headers to the message inside the ACL. Also, the condition returns true if SpamAssassin has classified the message as spam.

The following expansion variables are set up:

  • $spam_score: The spam score of the message, for example “3.4” or “30.5”. This is useful for inclusion in log or reject messages.
  • $spam_score_int: The spam score of the message, multiplied by ten, as an integer value. For example “34” or “305”. This is useful for numeric comparisons in conditions. This variable is special; it is saved with the message, and written to Exim's spool file. This means that it can be used during the whole life of the message on your Exim system, in particular, in routers or transports during the later delivery phase.
  • $spam_bar: A string consisting of a number of “+” or “-” characters, representing the integer part of the spam score value. A spam score of 4.4 would have a $spam_bar value of “++++”. This is useful for inclusion in warning headers, since MUAs can match on such strings.
  • $spam_report: A multiline text table, containing the full SpamAssassin report for the message. Useful for inclusion in headers or reject messages.

Usually the expansion variables will be used to fill some mail headers. An example that tries to "simulate" the headers added by a default SpamAssassin configuration might look like this:

# Perform classification
  set acl_m9  = ham
  spam        = $acl_m8
  set acl_m9  = spam

# Add "X-Spam-Flag:" header only if message was spam
  condition  = ${if {eq {$acl_m9}{spam}} {true}{false}}
  message    = X-Spam-Flag: YES

# Add additional headers regardless of message classification
  message    = X-Spam-Checker-Version: ???
  message    = X-Spam-Level: $spam_score ($spam_bar)
  message    = X-Spam-Status: $spam_report

How does the $acl_m8 variable get its value?

  • Some ACL (probably the one executed during RCPT) performs recipient verification using verify = recipient
  • This results in the execution of the verification router router_localuser_verify (discussed in the "Virtual domains" section on the Exim page)
  • That verification router sets address_data = $local_part
  • After execution of the verification router has finished, the ACL transfers the value of $address_data into $acl_m8 because $address_data loses its value after the verification process ends

Router/transport approach


Note: I no longer use this type of configuration, instead I prefer the ACL approach. For historical reasons I like to keep the following sections around, but I don't maintain the information anymore.

The solution using a router/transport combination basically works in the following way:

  1. a message is received by the Exim MTA
  2. the Exim MTA passes the message to a special transport that invokes the pipe spamc | exim4 and passes the message to stdin of spamc
  3. spamc contacts spamd on, port 783, and passes the message over the network connection
  4. spamd invokes SpamAssassin which classifies the message and adds mail headers with the classification results to the message
  5. spamd passes the modified message back to spamc over the network connection
  6. spamc prints the modified message to stdout, which is then passed to exim4 over the pipe
  7. exim4 re-submits the modified message to the Exim MTA, using a special protocol with the custom name spam-scanned
  8. the Exim MTA performs final routing/transport to the re-submitted message (re-submitted messages are distinguished from originally received ones by the spam-scanned protocol)


  driver = accept
  transport = transport_localuser_spamcheck
  condition = "${if and { {!eq {$received_protocol}{spam-scanned}} {!def:acl_m7} } {1}{0}}"
  domains = localhost

The router becomes active only for messages whose destination is localhost. This restriction can be set only because the router works together with the "virtual domains" solution presented on the Exim page:

  • the router router_virtualdomains transforms all addresses for virtual domains into local addresses
  • normally local addresses would be processed by router_localuser, but now router_localuser_spamcheck catches these addresses first
  • the process described in the "Overview" chapter above occurs
  • the re-submitted message then gets processed router_localuser

In addition to the domains = localhost restriction, the following conditions apply:

  • {!eq {$received_protocol}{spam-scanned}} = to prevent an infinite loop where the same message is spam-checked again and again
  • {!def:acl_m7} = spam checking has been disabled for some reason by an ACL (e.g. because the message was locally submitted)


  driver = pipe
  command = /usr/sbin/exim4 -oMr spam-scanned -bS
  transport_filter = /usr/bin/spamc
  use_bsmtp = true
  home_directory = "/tmp"
  current_directory = "/tmp"
  return_path_add = false
  log_output = true
  return_fail_output = true
  message_prefix =
  message_suffix =

The driver is pipe because the message should be passed on to an external program via a pipe.

The message is first passed to the external program defined in transport_filter, which in the present case is the spamc client. spamc is executed under the uid/gid that was defined by the router, i.e. a local user uid/gid. This leads to spamd using the configuration in that user's home directory (esp. the user's Bayes database).

The message that spamc returns is then passed to the external program defined in command. This leads to the the message being re-submitted:

  • -bS tells Exim to expect the message as "batched SMTP input"; see man exim4 for details
    • the transport option use_bsmtp tells the transport to write the message in "batch SMTP" format
  • -oMr spam-scanned tells Exim to set the $received_protocol variable to "spam-scanned" while the re-submitted message is processed
    • Exim only allows $received_protocol to be set by "privileged" users
    • therefore the exim4 process that processes the re-submitted message must run under a "privileged" uid/gid
    • according to the Exim docs the following users are privileged:
      • root
      • the user Debian-exim (or rather: the user that was defined in EXIM_USER when the package was compiled)
      • users that are listed in the main configuration option trusted_users
      • users that belong to groups that are listed in the main configuration option trusted_groups
    • in order for the presented transport to work, all local users must therefore be placed in the trusted_users list in the main configuration

return_fail_output is true so that if command returns with a non-zero exit code, the command's output is added to the bounce message.

return_path_add is set to false because the header will be added later when final delivery is made.

The message_prefix</tt and <tt>message_suffix options are explicitly unset in order to leave the message unchanged when it is re-submitted.

Trusted users

As mentioned above in the "Transport" section, all local users must be placed in the trusted_users list in Exim's configuration. This is achieved by creating this file:


The file defines the macro MAIN_TRUSTED_USERS which is converted by Debian's config-generating script into the Exim configuration option trusted_users. This is how the file looks like:

# 00_exim4-config_pelargir_trusted_users
# Set this macro very early with our own value so that our value takes
# precedence over the default value that the default configuration
# tries to assign later on.
# Trusted users are required if the "Router/transport" approach is
# chosen for spam checking.

MAIN_TRUSTED_USERS = uucp:patrick:francesca


  • Because the file has a prefix "00", it is processed before any of the other files in /etc/exim4/conf.d/main
  • Notably, it is processed before 02_exim4-config_options, which would set MAIN_TRUSTED_USERS to "uucp" if the macro hadn't already been set by our own custom file

Training the Bayes database

Automatic training

By default SpamAssassin automatically trains the Bayes database with messages that it is currently classifying. On my system, the file /usr/share/spamassassin/ sets the option bayes_auto_learn to 1.

Not all messages are automatically trained upon, though. The man page for Mail::SpamAssassin::Plugin::AutoLearnThreshold provides details:

  • A message is automatically trained only if
    • Its score is below a certain threshold (in this case it is trained as ham)
    • Its score is above a certain threshold (in this case it is trained as spam)
  • The score that is compared to the lower/upper thresholds does not contain certain tests
  • In order to train a message as spam, the message must score at least 3 points in the header and at least 3 points in the body

The defaults for the thresholds are 0.1 (lower) and 12 (upper).

Manual training

Manual training is done by the sa-learn program. See the man page for details.

Basic invokation:

sa-learn --spam /path/to/mail/folder
sa-learn --ham /path/to/mail/folder

To train a single message:

cat messagefile | sa-learn --ham

Notes about the training process:

  • Re-training a message as ham that was previously trained as spam (and vice versa) makes SpamAssassin forget about the previous training
  • 1000 messages each for ham and spam is the minium required for successful training; training less messages works, too, but the results are not satisfying
  • Training more than 5000 messages does not much increase the quality of results
  • Training should be done with current spam messages since spam continually changes its appearance
  • The age of ham and spam messages being trained should be about the same; training old spam but new ham leads to messages with an old timestamp to be classified as spam - even if they are ham
  • The quantity of ham messages being trained should be larger than the quantity of spam messages; if the quantity is less results may not be satisfying (esp. if only a few ham messages are trained)


Users are given the opportunity to train their individual spam filter. A periodical cron job run in every user's context scans two pre-defined IMAP folders from which it learns ham and spam messages. The folders are


The user may place any messages she wants into these folders. Usually such messages will be false-positives and/or false-negatives.

This is the cron job definition:

root@pelargir:~# cat /etc/cron.d/pelargir-sa-learn 
0 * * * *   patrick     /usr/local/htb/bin/
0 * * * *   francesca   /usr/local/htb/bin/

The script is part of my "herzbube's tool box", you can find it here. A script that does the same but can be run standalone is this one:


# ------------------------------------------------------------
# Arguments
#  None
# Exit codes
#  0 = ok
#  1 = this program is already running for the current user
#  2 = a prerequisite could not be found
# ------------------------------------------------------------

# ------------------------------------------------------------
# Initialize variables

# Maildirs

# Programs

# Other variables
LOCK_FILE="$HOME/$(basename $0).$"

# ------------------------------------------------------------
# Check if this program is already running for the current user
if test -f "$LOCK_FILE"; then
  exit 1

# ------------------------------------------------------------
# Sanity checks
for BIN in "$SA_LEARN_BIN"
  which "$BIN" >/dev/null 2>&1
  if test $? -ne 0; then
    echo "$BIN could not be found"
    exit 2

# ------------------------------------------------------------
# Create lock file. From now on, do not return without removing the file
echo $$ >"$LOCK_FILE"

# ------------------------------------------------------------
# Process all messages
for MESSAGE_TYPE in ham spam
  if test "$MESSAGE_TYPE" = "ham"; then
  elif test "$MESSAGE_TYPE" = "spam"; then

  # Learn messages, then move them to different folder
  for SUB_DIR in new cur
    if test ! -d "$TRAINING_DIR"; then
      echo "Training directory not found: $TRAINING_DIR"

    # Learn/re-learn messages
    echo "Learning $MESSAGE_TYPE from $TRAINING_DIR for $USER" 2>&1 | logger
    $SA_LEARN_BIN "--$MESSAGE_TYPE" "$TRAINING_DIR" 2>&1 | logger

# ------------------------------------------------------------
# Cleanup
rm -f "$LOCK_FILE"

cron, approach 2

The following command line fetches all email in a given mailbox directory and stores it in a file. The messages are modified with an additional "Received:" header.

/usr/bin/fetchmail -a -v -v -n -p IMAP --folder '' -u patrick -m 'bash -c "/usr/bin/tee >/tmp/aaa.test"' localhost

The following command line delivers a mail message in a given file to a given directory. The command must be run as the user to whose mailbox delivery should take place.

cat /tmp/aaa.test | maildrop  /tmp/

The maildrop filter file looks like this, it must belong to the user for whom delivery takes place:

to "$HOME/Maildir/.aaa"