blogpost_3

Or how we went from 2 000 000 to 200 000 daily SMTP connections overnight

If you are running a huge pile of email servers, like we do, you deal with a lot of SPAM messages on a daily basis. In which case - godspeed, but this article is not about the way you find, flag, and prevent SPAM. It is about its impact on the computational resources practically wasted by the servers for processing messages that afterwards ended up in the junk folder. Turns out, you can effectively find a way to reclaim all that computational power and use it for better purposes. Even if you don’t deal with huge volumes of SPAM messages in particular, our quest could surely shed some light on some simple server optimisations that can help you use your resources much more efficiently.

Measuring SPAM in terms of server resources

At Kyup we use EXIM as our message transfer agent (MTA). It is free software distributed under GNU General Public License, and it aims to be a general and flexible mailer with extensive facilities for checking incoming email.

To understand the amount of server power that goes into servicing email, let’s drill down into the lifetime of a single SMTP session:

  1. Server acknowledges TCP connection
  2. EXIM MTA forks child process to handle that connection
  3. Additional SSL/TLS overhead might be added depending on the session requirements
  4. MTA starts processing the message
  5. On each phase during the SMTP session additional filtering and classification is applied based on different criteria
  6. Finally, EXIM MTA takes decision to accept and deliver, defer, or deny the e-mail message.

Each step above takes up precious system resources, in order for those requests to be computed and fulfilled. Also, each of those steps is performed for each and every SMTP session. At the time, the average amount of SMTP connections on each of our servers was 2 000 000 pеr 24 hours. Yes, that’s two million daily on each server.

So, to get an idea of how many server resources were being used by our whole infrastructure for SMTP processes only, multiply each step above by 2 000 000. Then multiply that by thousands of mail servers we operate. And the total number you’ll get would be just for a 24-hour interval.

It’s known that 70-80% of the connections established to a single SMTP server eventually turn out to be associated with SPAM delivery attempts. This means the resources that go into servicing these requests are effectively wasted. Such an extreme dissipation made us start thinking what we can do to optimise our email servicing processes in order to free up those resources and use them for better purposes.

Asking the right questions

We started working on the problem by breaking it down to several simple questions:

  • Can we stop SMTP connections known to be associated with previous SPAM attempts from happening entirely?
  • How we can do that at the first possible sign and save most of the steps associated with the connection handling overhead?
  • What system resources we will be able to save if we manage to achieve that?
  • How we can do all of the above lightening-fast without wasting more resources than we were actually going to save?

A reliable way to identify IPs associated with SPAM messages

If you pass certain known-to-be-SPAM-mail-message via SPAM assassin, dSPAM or similar solutions and this message is known to be SPAM you have two types of classifications. Low-to-medium probability of SPAM and screamingly high probability of SPAM.

For the purpose of our system we started to flag message/IP pairs where messages were identified only with high probability of SPAM. We used SPAMAssassin with a combination of custom EXIM ACLs we built over time that were 99.9% accurate to identify such high-probability-of-SPAM messages. With EXIM we also started to flag message/IP pairs where IPs were known to deliver good messages or such with low probability of SPAM.

What we were missing at that point was how EXIM can block those IPs? Since EXIM is an MTA, its job is not to take care of the blocking at all. What we actually needed was a communication method between EXIM and an external system (daemon) where EXIM will report certain information and the daemon will take care of the rest.

EXIM readsocket - a fast and easy method for communication between the MTA and the rest of the world

While looking at the EXIM documentation, we discovered EXIM readsocket string expansion. For us it was one of the super-handy features provided by the MTA that allowed us to report to the external processor what is actually happening inside the MTA during ACL processing. Any EXIM variable that can be used inside the ACLs can be reported to an external program. Awesome, right?

With “readsocket“ if you have your own custom EXIM ACLs rules, you can tell MTA to report certain information to inet or unix domain socket in case all of the ACL criteria are met and do that right before the SMTP processing action is taken by the specific rule.

Suppose you are processing a message with SPAMAssassin inside an ACL and you want the IP of the delivering mail server to be reported to an external program if SPAMAssassin marks the message as really “bad“ and you want to do that right before you drop or deny the SMTP connection. With readsocket it is really simple:

This line instructs EXIM to establish unix domain socket connection to /var/run/SPAMprocessor.socket and send the following string to the connection “SPAM|$sender_host_address”. It will wait for 3 seconds and continue if the connection can't be established without caring about the output received after the connection was established and the “message“ was reported to the socket.

Now we needed something to listen on that unix domain socket and act on certain conditions.

Building the stats collector

We decided to create a simple daemon that listens on unix domain socket. It accepts data on that socket and uses it to sample statistics about IP addresses and whether they are delivering SPAM or SPAM-free mail messages.

With a little bit of logic-and-magic in that daemon we can sort out the IPs of mail and non-mail servers that are attempting to deliver messages where 90% (or any number) of previous delivery attempts were for SPAM or SPAM-free messages.

So the daemon got all bad IPs that are sending SPAM-only mails from EXIM but what we really wanted to achieve was the entire system to save precious system resources by preventing flagged IPs to be able to establish TCP connections to the SMTP (and perhaps other) TCP ports.

Blocking bad with ipset - low overhead and close-to-zero maintenance costs

If you have been working in the network world enough and managing a lot of systems, sooner or later you notice that having 10k iptables rules in your firewall is not something you really want. At least most of the time. I suppose that other people had the same problem and that's how Ipset was born.

Ipset basically allows you to store huge lists of IP addresses, networks, complex ip-port relations etc. Later you can apply certain actions with iptables and do this in a single rule.

So basically you have a separation between the parts where you store the IP list in one really-fast-to-lookup structure (ipset) and where you reference that list in a single rule which will be iptables with set match support.

This allows us to store huge numbers of ips that we want to block from accessing the SMTP service on our servers and most importantly to do this with minimum system and network overhead. Store in ipset, block with iptables.

When we refer to close-to-zero maintenance we wanted to define how long each blocked IP should stay in the block list and let this block automatically expire, straight out of the box. Here the ipset iptree set type helped a lot too. It allowed us to create separate lists where we define the expiration time for the objects added to that list. Finally, we combine all those separate lists by referring them to a single list of type setlist and we match against it with a single iptables rule. Voila.

A little side-note here: “iptree” set type was removed in ipset version 6 and it is now replaced by hash:ip type of sets.

Expanding the logic of the blocker daemon

Having different ipset lists with different expiration times allowed us to do some more stuff.

If we add an IP object to a list with expiration 1 hour that particular IP address will be unable to connect to the SMTP service for 1 hour. When it is removed from the list it will be able to re-connect to the SMTP server again.

However if the very same IP continues to attempt to deliver mail messages mainly flagged as SPAM it is then added to a set block list with expiration 2 hours. The third time, the expiration period is increased to 4 hours, the fourth to 8 hours and so on. The more SPAM you send, the longer the block gets. Finally, we started adding “bad” IPs to 24-hour block lists directly.

Wrapping it up

We were thrilled how simple the end flow looked like:

  1. EXIM will flag ip-message pairs as good or bad
  2. EXIM will report those to external socket
  3. Custom daemon will listen on that daemon and will sample statistics about bad Ips/networks
  4. If IP exceeds certain threshold of SPAM vs SPAM free messages it is considered for blocking
  5. The length of the block depends on the previous blocking history for that IP/net
  6. IP addresses blocked are stored in ipset iptree set list with different expirations
  7. iptree set lists are merged into single setlist list
  8. The actual blocking is done with a single iptables rulesets

And more importantly, the entire system had very low overhead, close to zero maintenance and the more you refine your ACLs and SPAM criteria, the better it gets.

The Results

When the system went in production and was deployed on all mail servers and things were running smooth, we left it overnight till the logs rotate and we get a glimpse of the end results.

The next morning to my surprise the ~2 000 000 daily average SMTP connections each server was receiving before had dropped to ~200 000 for 24 hours. I could not believe the magnitude of gain from such a simple system. 2 000 000 vs 200 000 daily SMTP connections per server – the result was 10/1. Now multiply those “saved” (more precisely “skipped”) connections by the number of servers and that's for 24 hours only.

The set lists were full of blocked IPs with denied access to the smtp service. No system is perfect and completely maintenance-free but further analyses showed that most of them were properly flagged for blocking and they were SPAM smtp servers, zombie devices with viruses and so on and so on that should never deliver mails to our servers nor their smtp connections should be processed at all.

Since then our SMTP servers are still running EXIM and are still happily delivering mails back and forth. They have many more free resources than before, otherwise wasted on SMTP handling and message processing. With that we managed to put the spare power into much better tasks.

Further reading

It is always a good idea to manage central list of IP addresses and networks that should never be blocked by such a system. You can still warn if bad mails from those exceed certain limits but it is really up to your needs. Zookeeper might help you with this.

It is also a good idea to save more resources on the neighbouring SMTP systems if you know about certain IPs marked as bad. You can use distributed notification mechanisms that will force those neighbours to also block offending IPs. Take a look at Serf and let us know what do you think.