Spam

From HerzbubeWiki
Jump to navigation Jump to search

Overview

This page discusses my experiences with the Spam problem.


Mail server configuration

Details about my mail server configuration can be found on these pages:


Related/interesting pages


Spam statistics

I did not yet get around to writing a script that collects data from my mailbox so that I can automatically generate spam statistics. I have therefore decided that whenever I clean out my spam folder, I will note down details about the state of affairs in the following table. Each line contains information about the period of time that has passed since the date of the previous entry.


Date Days elapsed Spam messages received Messages/day Correctly classified by SpamAssassin False negatives False positives Notes
Manually trained DNS blacklist warning
04.06.2008 n/a 46000 n/a n/a n/a n/a - -
12.08.2008 70 47365 677 46354 (97.9%) 1011 (2.1%) n/a - -
30.09.2008 49 39099 798 38222 (97.8%) 877 (2.2%) n/a - -
07.12.2008 68 45088 663 44177 (98.0%) 911 (2.0%) n/a - For a couple of weeks, the daily amount of spam had decreased significantly. I guess I have been experiencing the direct result of web hoster McColo being taken off the net. Unfortunately, the rate has been getting back to "normal" (see this story about the spammers' backup plan).
12.02.2009 67 45624 681 44733 (98.0%) 891 (2.0%) n/a - -
23.04.2009 70 47366 677 46254 (97.7%) 1112 (2.3%) n/a - Didn't train for the last 40 days (while on freighter travel)
10.06.2009 48 49481 1031 47965 (96.9%) 1516 (3.1%) n/a - I have trained often, even though I have been travelling, but still the rating of unrecognized spam has gone up and is, in fact, worse than during the previous period where I did no training at all. Two possible reasons for this are:
  1. There was an overall surge of spam, the message/day ratio has gone up by 52%!!!
  2. For some time sender addresses from the herzbube.ch domain were abused, resulting in an increased number of collateral spam ("undeliverable message" etc.) that got through the filter.

On a side note: I pruned the auto-whitelist database, which had grown to massive size over the years, but this should not have had an influence on the number of unrecognized spam.

30.07.2009 50 58462 1169 55520 (95.0%) 2942 (5.0%) n/a - The basic "spam message per day" ratio has increased again, but what is even worse: More spam than ever has passed by the filter, the average is now 60 spam messages per day in my inbox :-( Will this terror never end?
07.09.2009 39 45899 1177 44320 (96.6%) 1579 (3.4%) n/a 1 The picture remains unchanged, but today I have finally, reluctantly, implemented greylisting. It will be interesting to see how greylisting affects this whole spam affair. An interesting number on the side: Of all the spam messages I received, 3796 (8.3%) had a recipient that contained "iana.pen". This pretty much says everything about address harvesting... Another side note: Today I added the "False positives" column because in the previous period I had one of these. In earlier periods the column says "-" because I have no reliable numbers. However, if I recall correctly, I have had only 2-3 false positives in all the time since I am using SpamAssassin (between 5 and 6 years).
30.10.2009 53 5223 99 4872 (93.3%) 351 (6.7%) n/a - After almost 2 months I conclude that greylisting is the most effective anti-spam measure that I have ever seen: Implementing it reduced the message/day rating by an impressive 92%. Of all the spam that still came through, 4568 (87.5%) messages were delivered via the backup MX (virusscan.solnet.ch) which unfortunately does not implement greylisting. I have now temporarily removed the backup MX entry from my DNS configuration (and reset the greylist daemon's whitelist) - it will be very interesting to see the results of this latest experiment.

Update on the iana.pen statistics: 358 (6.9%) messages had the string "iana.pen" in the To: header.

18.12.2009 49 992 20 699 (70.5%) 293 (29.5%) n/a - After another 7 weeks of running entirely on a diet of greylisting (i.e. the backup MX was turned off all the time), the numbers look even better: The message/day rating went down by another hefty 80%, if compared with the ratio of the pre-greylisting era the improvement is now over 98%!!! An interesting observation is that the effectiveness of greylisting has lowered SpamAssassin's recognition percentage. It appears that spammers who are capable of circumventing greylisting are also better with crafting "quality" spam that can fool SpamAssassin. My new goal therefore is to raise SA's recognition rate to >=95%.

iana.pen statistics: 68 (6.9%) messages had the string "iana.pen" in the To: header.

10.03.2010 82 2441 30 2078 (85.1%) 363 (14.9%) n/a - In the almost 3 months since the last count, SA's recognition rate has significantly increased, probably due to the longer sampling period and therefore a better average. Although I figure I could improve the rate still further by tweaking SA parameters more aggressively, I do not want to risk any false positives. At present, I therefore let the matter stand as it is.

iana.pen statistics: 143 (5.9%) messages had the string "iana.pen" in the To: header.

14.12.2010 280 10133 36 9554 (94.3%) 579 (5.7%) n/a 4 With 9 months this has been the longest sampling period since I started this statistics page! I'm glad to see that SA's recognition rate has further improved without effort on my side - isn't this what computers are supposed to do: Lifting the burden of work from man's shoulders? :-)

iana.pen statistics: 1563 (15.4%) messages had the string "iana.pen" in the To: header.

05.01.2012 387 7393 19 6715 (90.8%) 678 (9.2%) n/a - Slightly more than a year has passed since the last sample. In this time the message/day spam rate has dropped to an all-time low. It is unclear whether the reason for this is a world-wide decrease in spam mails, or a decrease in the "quality" of spam mails, i.e. fewer spam mails make it past the greylisting wall. My gut feeling is that it is the latter. Although less overall spam is good, SpamAssassin's recognition rate has dropped by almost 4%. This makes for 1.75 spam mails per day in my inbox, which is still less than the 2.06/day average of the last sampling period (due to the low overall spam rate).

iana.pen statistics: 319 (4.3%) messages had the string "iana.pen" in the To: header.

07.06.2013 519 7308 14 6576 (90.0%) 732 (10.0%) n/a - 17 months after the last sample, I'm pleased to see that the message/day spam rate has dropped again. The average number of spam mails that have made it into my inbox is now at 1.4 messages/day. The SpamAssassin recognition rate is still about the same, but I guess it's hard to have a better rate without resorting to black lists.

iana.pen statistics: 481 (6.6%) messages had the string "iana.pen" in the To: header.

16.01.2015 588 6252 11 5301 (84.8%) 951 (15.2%) n/a - Another long sampling period (19 months) and the message/day spam rate has dropped for the third time in a row. So far so good, but unfortunately at the same time SpamAssassin's recognition rate has dropped for the third time in a row. This time the drop was so marked that, although the total number of spam messages is lower, the average number of spam mails that have made it into my inbox has increased to 1.6 messages/day - the first increase since I switched to greylisting. This turnaround is important - and a little scary - because for the first time in almost 5 years spammers have actually become better in getting their junk into my inbox. I sincerly hope this trend will not continue.

iana.pen statistics: 509 (8.1%) messages had the string "iana.pen" in the To: header.

21.05.2016 491 5859 12 n/a n/a n/a - Roughly 16 months have passed since the last sample. For the first time since December 2010 the message/day spam rate has slightly increased. Unfortunately I can't say anything about the quality of spam because in the catastrophic server failure that happened on May 21 I have lost all information in this regard.

iana.pen statistics: 433 (7.4%) messages had the string "iana.pen" in the To: header.

26.07.2016 66 306 5 241 (78.8%) 62 (21.2%) n/a - This date marks resumption of regular mail service on pelargir.herzbube.ch, from now on with active DNS blacklist checks thrown into the mix of anti-spam measures - it will be interesting to see what effects this has on my spam statistics.

66 days have elapsed since the catastrophic server outage on May 21 this year. The statistics of this period certainly cannot be used for comparisons, because those 66 days are a mixture of

  1. No emails received at all for a few days after the server outage
  2. Emails received by a mail hoster (switchplus.ch), an emergency replacement of my own email service; and
  3. Emails received by the re-established but still experimentally configured email service on pelargir.herzbube.ch.

iana.pen statistics: 39 (12.7%) messages had the string "iana.pen" in the To: header.

25.05.2017 303 7309 24 6604 (90.4%) 510 (7.0%) 195 (2.7%) - 10 months since the last sample. The overall spam rate has increased significantly, since 2012 there hasn't been a higher spam message/day ratio. One possible explanation for this is that I have somewhat relaxed my greylisting configuration: I am now running with a greylisting delay of only 1 minute instead of the previous 10 minutes, and I also have whitelisted amazonses.com. The good news is that the rate of spam that made it into my inbox has not substantially increased: I had to manually train 510 messages, which is 1.7 messages/day.

iana.pen statistics: 605 (8.5%) messages had the string "iana.pen" in the To: header.

And now for the real news: How did DNS blacklists influence my spam "experience? First, a short summary of my DNS blacklist policy: If a sending host is on at least two of the four DNS blacklists that I am using, I am outright rejecting any traffic from that host. If a sending host is on only one of the blacklists, I accept its traffic, apply the usual heuristics to the messages it sends, then place any messages that are still classified as ham into a special "DNS blacklist warning" folder. The statistics here look like this:

  • 195 messages were actually spam, i.e. false negatives. 150 were blacklisted by Barracuda, 20 by SpamCop and 25 by Spamhaus. I'm using this little command to count: grep X-DNSbl-Warning * | sed -e 's/.*blacklisted by //' | sort | uniq -c
  • 14 messages were ham, i.e. correctly classified. 1 message was from the Mantis bugtracker (mantisbt.org, blacklisted by SpamCop), two were Git commit messages from the RCMCardDAV mailing list (blacklisted by Barracuda), another one was a library newsletter (blacklisted by SpamCop), and the remaining 10 were notifications from Facebook (all blacklisted by SpamCop).

So to calculate the overall spam recognition rate we have to add those 195 messages to the 510 that I had to manually train. The overall rate therefore is 9.7%, which is not too bad in itself, and also those 195 messages didn't make it into my inbox, which is even better. I am a little concerned about the 14 ham messages which also didn't end up in my inbox, but none of them was really important, so my concern is not too great at the moment.

04.08.2018 436 9493 22 8932 (94.1%) 473 (5.0%) 88 (0.9%) - 14 months since the last sample. The overall spam rate has slightly decreased despite my further relaxing the greylisting configuration - I'm still at 1 minute delay, but I had to whitelist more hosts, among them google.com. SpamAssassin has also been able to hold its own with an overall of only 473 false negatives that made it into my inbox, or 1.1 messages/day.

iana.pen statistics: 750 (7.9%) messages had the string "iana.pen" in the To: header.

DNS blacklists statistics:

  • 88 messages were actually spam, i.e. false negatives. 57 were blacklisted by Barracuda, 18 by SpamCop and 13 by Spamhaus. See the previous entry for the command I'm using to count.
  • 31 messages were ham, i.e. correctly classified. 2 messages were from AirBnb (o7.email.airbnb.com [167.89.32.249] blacklisted by SpamCop), 1 was from the insurance company AXA (o1.hv30nn.shared.sendgrid.net [50.31.63.33] blacklisted by SpamCop), 1 was from Atlassian's Jira bugtracker for the SourceTree product (mail186-12.suw21.mandrillapp.com [198.2.186.12] blacklisted by SpamCop), another one was a library newsletter (mail229.atl101.mcdlv.net [198.2.130.229] blacklisted by SpamCop), and the remaining 26 were notifications from Facebook (e.g. 69-171-232-141.outmail.facebook.com [69.171.232.141] and other sending hosts, differring only in their IP address, all blacklisted by SpamCop).

The number of DNS blacklist warnings which were actually ham messages has increased from 14 to 31, but again I am only moderately concerned because none of the messages were important. Interesting side note: The last Facebook notification message came in on 22 August 2017, so I assume that I have changed some of my Facebook notification settings at around that time. Because of this I expect that when I update these statistics the next time there will be significantly less DNS blacklist warnings.

13.11.2019 466 10377 22 9422 (90.8%) 868 (8.4%) 87 (0.8%) - 15 months since the last sample. The overall spam rate has remained stable at 22 messages/day. In total I now have 7 entries in the whitelist of my greylisting configuration. SpamAssassin has slightly lost ground with an overall of 868 false negatives that made it into my inbox, or 1.9 messages/day.

iana.pen statistics: 571 (5.5%) messages had the string "iana.pen" in the To: header.

DNS blacklists statistics:

  • In total, 98 messages were classified by one DNS blacklist service as spam but SpamAssassin classified them as ham
  • 87 messages were actually spam, i.e. SpamAssassin false negatives. 64 were blacklisted by Barracuda, 10 by SpamCop and 13 by Spamhaus. See the entry from 25.05.2017 for the command I'm using to count.
  • 11 messages were ham, i.e. correctly classified by SpamAssassin. 3 messages were from AirBnb (o21.email.airbnb.com [167.89.102.173], o14.email.airbnb.com [50.31.32.8] and o7.email.airbnb.com [167.89.32.249], all blacklisted by SpamCop), 1 was the "10 Years of Stack Overflow" anniversary note from StackOverflow (o16824532x199.outbound-mail.sendgrid.net [168.245.32.199] blacklisted by SpamCop), 1 was an activity notification from GitHub for one of the projects I was watching (o4.sgmail.github.com [192.254.112.99] blacklisted by SpamCop), 2 were newsletters from Project R / Republik (mail99.us4.mcsv.net [205.201.128.99] blacklisted by SpamCop), 1 was a "trial ending soon" reminder from ExpressVPN (o1.outbound-email.expressvpn.com [167.89.103.163] blacklisted by SpamCop), 1 was a survey reminder from wemakeit (mail19.atl11.rsgsv.net [205.201.133.19] blacklisted by SpamCop), and the last 2 were info mails regarding the "Evolution of Yahoo Groups" from Yahoo (mta038aa.pmx1.epsl1.com [159.127.162.38] and mta027aa.pmx1.epsl1.com [159.127.162.27], both blacklisted by SpamCop).

The number of DNS blacklist warnings which were actually ham messages has decreased from 31 to 11. The decrease is due to Facebook no longer sending me any emails, first because of the settings changes mentioned in the previous entry, and second because I have virtually stopped using Facebook. As in the last entry, none of the 11 ham messages were really important, although I would have liked to receive the Project R / Republic newsletters. I cannot help to note that SpamCop is the sole source of all those false DNS blacklistings - in the future I might consider to stop using SpamCop if I find it's too aggressive.

31.03.2021 504 10684 21 9050 (84.7%) 1491 (14.0%) 143 (1.3%) - 17 months since the last sample. The overall spam rate has slightly decreased from 22 to 21 messages/day. In total I now have 9 entries in the whitelist of my greylisting configuration. SpamAssassin has significantly lost ground with an overall of 1491 false negatives that made it into my inbox, or almost 3 messages/day (up from 1.9 message/day in the previous sample). It appears that the "quality" of spam has improved.

Spamtrap statistics:

  • 748 (7.0%) messages had the string "iana.pen" in the To: header.
  • 1 (one) message was sent to another of my spamtrap addresses.

DNS blacklists statistics:

  • In total, 164 messages were classified by one DNS blacklist service as spam but SpamAssassin classified them as ham.
  • 143 messages were actually spam, i.e. SpamAssassin false negatives. 111 were blacklisted by Barracuda, 22 by SpamCop, 9 by Spamhaus and 1 by SORBS. See the entry from 25.05.2017 for the command I'm using to count.
  • 21 messages were ham, i.e. correctly classified by SpamAssassin. 11 messages were newsletters from Project R / Republik (mail101.sea31.mcsv.net [148.105.11.101], mail169.sea71.mcsv.net [148.105.11.169], mail75.us4.mcsv.net [205.201.128.75], mail57.atl11.rsgsv.net [205.201.133.57] and mail61.sea91.rsgsv.net [148.105.15.61], all blacklisted by SpamCop), 4 were final notifications from Yahoo regarding their shutdown (mta026aa.pmx1.epsl1.com [159.127.162.26], mta028aa.pmx1.epsl1.com [159.127.162.28] and mta054aa.pmx1.epsl1.com [159.127.162.54], all blacklisted by SpamCop), 4 were friendship requests from Teleboy (suitepmta022003.emsmtp.us [185.90.22.3], suitepmta022116.emsmtp.us [185.90.22.116], all blacklisted by Barracuda), and the last 2 were continuous integration build notifications from Travis CI (mail179-7.suw41.mandrillapp.com [198.2.179.7] and mail133-10.atl131.mandrillapp.com [198.2.133.10], both blacklisted by SpamCop).

The number of DNS blacklist warnings which were actually ham messages has almost doubled from 11 to 21. To some extent the increase can be explained by a corresponding overall increase of messages blacklisted by DNS blacklist services (164, up from 98 in the last period). In any case, it is again SpamCop that is responsible for most of the false blacklistings. As announced in the previous entry, I'm now disabling SpamCop to see whether this makes a difference.

17.06.2022 443 11327 26 10510 (92.8%) 740 (6.5%) 77 (0.7%) - Almost 15 months since the last sample. The overall spam rate has increased again from 21 to 26 messages/day. The number of entries in the whitelist of my greylisting configuration did not change and is still 9. After the drop in recognition rate in the previous sample SpamAssassin has now caught up again, so that only 740 false negatives have made it into my inbox, which is about 1.6 messages/day (down from almost 3 messages/day in the previous sample). The recognition rate of 1.1 messages/day in 2018 is still the peak, but as long as it stays below 2 messages/day I'm quite happy.

Spamtrap statistics: 1330 (11.7%) messages had the string "iana.pen" in the To: header. This is the second-highest rate since I started this statistics. The only other time there was a higher rate was in 2010 (15.4%).

DNS blacklists statistics:

  • In total, 77 messages were classified by one DNS blacklist service as spam but SpamAssassin classified them as ham.
  • All messages were actually spam, i.e. SpamAssassin false negatives. 52 were blacklisted by Barracuda, 23 by Spamhaus and 2 by SORBS. See the entry from 25.05.2017 for the command I'm using to count.
  • None of the messages were ham, i.e. correctly classified by SpamAssassin.

So it seems that disabling SpamCop (see previous entry) was a good move, as DNS blacklisting now actually fulfills its purpose without me having to worry that false positives are sorted out. Sorry SpamCop, but you stay disabled until further notice.

EDIT: Fixed the numbers in the column "Correctly classified by SpamAssassin" - the previous number did not include the iana.pen messages.

29.09.2023 469 8536 18 7038 (82.5%) 1301 (15.2%) 197 (2.3%) - Almost 16 months since the last sample. The overall spam rate has decreased from 26 to 18 messages/day, which is the lowest rate since 2016. Although good news in theory, in practice it did not help my user experience because at the same time the SpamAssassin recognition rate has massively dropped again, to one of the lowest rates that I have ever recorded (only one sample each in 2009 and 2016 had lower rates)! With 1301 false negatives in this sampling period, I once again had to deal with almost 3 messages/day in my inbox on average. Most concerning is that in recent months the false negatives rate has clearly increased above the sample average: I am now regularly training SpamAssassin with about 5-8 false negative messages/day - and training does not seem to help at all. Eventually I will need to address this.

The number of entries in the whitelist of my greylisting configuration has increased from 9 to 10, but I don't think that this one whitelisted host is responsible for the increased number of false negatives. Nevertheless, I disabled the host since I don't need mails from it anymore.

Spamtrap statistics:

  • 2409 (28.2%) messages had the string "iana.pen" in the To: header. This is the highest rate since I started this statistics.
  • 1 message had one of my spamtrap addresses in the To: header. After searching through the other spam messages I realized that most if not all of my spamtrap addresses are being used quite a lot in various mail headers - just not in the To: header! I now believe that these addresses have been harvested from one of my websites, probably from the SpamAssassin page on this wiki where I'm foolishly publishing my .forward file.
  • Conclusion: This and older spamtrap statistics are probably worthless, so I will probably cease to collect them from now on.

DNS blacklists statistics:

  • In total, 197 messages were classified by one DNS blacklist service as spam but SpamAssassin classified them as ham.
  • All messages were actually spam, i.e. SpamAssassin false negatives. 186 were blacklisted by Barracuda, 10 by Spamhaus and 1 by SORBS. See the entry from 25.05.2017 for the command I'm using to count.
  • None of the messages were ham, i.e. correctly classified by SpamAssassin.
  • SpamCop stays disabled for good.
06.07.2024 281 ? ? ? ? ? - Missing sample due to data loss. About half-way through this sample period I introduced new local SpamAssassin rules because the vanilla rules were no longer able to keep the spam out of my inbox. About 2-3 messages per day made it into my inbox. After the introduction of the new rules spam recognition rose drastically.


How to calculate the statistics

Note to self how to count. One of these days I will write a script that automates the process based on these steps.

  • Move all messages from Training-spam to Trained-as-spam
  • Rename the following inboxes, then create new ones with the original name: DNSbl-Warning + Incoming + spamtrap + Trained-as-spam
  • Create inbox DNSbl-Warning-legitimate
  • Go through all messages in DNSbl-Warning and move legitimate messages to DNSbl-Warning-legitimate
  • Days elapsed: Go to some webtool (e.g. this one) and enter the dates of the previous and the new entry.
  • Spam messages received = DNSbl-Warning + DNSbl-Warning-legitimate + Incoming + spamtrap + spamtrap/ianapen + Trained-as-spam
  • Messages/day = Spam messages received / Days elapsed
  • Correctly classified by SpamAssassin = Incoming + spamtrap + spamtrap/ianapen
    • DNSbl-Warning message are not counted here because the messages were caught by the DNSbl system, not by SpamAssassin
  • False negatives - Manually trained = Trained-as-spam
  • False negatives - DNS blacklist warning = DNSbl-Warning
  • False positives = DNSbl-Warning-legitimate
    • Training-ham is expected to be empty because I never look at Incoming and spamtrap anymore. If any such false positives would occur I would have to note them down.
    • I also don't look at DNSbl-Warning and only do the legitimate/non-legitimate separation when I am updating the statistic on this page. So far this has never been a problem, because these false positives have never been important.