Something that comes up quite often is related to the volume of junk email getting through to our users’ mailboxes. It’s usually phrased something like “Why am I getting so much spam?” or “I”m getting huge amounts of spam!” The answer is not one anyone wants to hear and certainly not one I like giving.
It turns out that spam filtering is hard. The people sending this junk are very creative and they are very good at making it look like legitimate email. They are also very good at identifying what gets filtered and changing their tactics so that it has a better chance of getting through. In other words, it’s an arms race between the filters and the spammers, and the spammers are winning.
One big question I get all the time is why we can’t make the filter more aggressive. After all, their IT department at work seems to be able to block a much larger amount of the junk. To understand why a single organization IT department can filter more aggressively than a provider with even a handful of customers, let alone hunderds or thousands, you have to understand the idea of false positives. A false positive is simply a message that is identified as junk but is, in fact, legitimate.
Let’s consider an example. For instance, suppose you request a password reset link from your favourite web site. That will have a link of some kind with a small amount of text in it. You would be greatly annoyed if that message was filtered. Yet these often do get filtered. It happens often enough that many sites actually have an explicit request that check your spam folder or add their sending address to a white list. This sort of false positive is fairly innocuous and is usually discovered quickly by virtue of an expected message simply not arriving.
Often it is not nearly so obvious that a message failed to arrive. This is the case with unsolicited messages. These could be business queries, support requests for your product, or other things you might want to see. It is in this category of messages that we see the true cost of false positives. In the case of a business query or support request, to the sender it will look as though it has simply gone unanswered, something that is bad for business. Meanwhile the business operator has no idea there is a problem or query and he loses a customer. For general business purposes, it is far better to have to manually deal with even dozens of junk messages than to lose a single legitimate query.
For this reason, much spam attempts to mimic legitimate business messages so that it tends to get past filters tuned to minimize false positives. As a result, over time, more of the junk makes it through filters and it gets increasingly difficult, often impossible, to accurately filter such messages. The same effect aplies to things like renewal notices, newsletters, and the like.
Now you may be thinking that it’s obvious that something is spam when you look at it. And, yes, that is usually the case. So why, then, does it get through? There are many reasons but most boil down to one thing. Semantics. To us as readers, the words we see have meaning. We can very quickly associate the words we see with a meaning and know whether that makes sense or not. Further, even if it does make sense, we can often identify patterns of meaning that indicate a scam or something generally not wanted. All of that is behind a visual system that is very good at pattern recognition which is observing the fully rendered message.
A computer, on the other hand, is not looking at the fully rendered message. It’s looking at the underlying representation. While that does give it some additional clues that are often useful, it is missing a very important aspect of the analysis. Even in the absence of tricks to obscure the text in a message, the computer has no semantic understanding of the words. It has no cognitive model for the world that makes sense. It doesn’t know that one offer is “too good to be true” while another looks legitimate. Thus, even if it could read the text in a fully rendered message, it is still going to have a distinct disadvantage over a human when identifying junk.
So, with all of that said, what does that have to do with single organization filters being easier? Simple. A single organization usually has a single policy on what is an acceptable level of false positives, and also, they often have a well defined group of senders they are willing to inconvenience. By that I mean a business whose customers and business contacts are primarily located in, say, Calgary will generally not feel any pain by filtering all messages originating in Asia, for instance. In that way, they can markedly reduce the amount of junk getting through their filters by simply filtering anything that is not in English and anything that originates outside of North America. They can then further restrict what they accept based on policies such as attachments and so on. This works fine for a single business since there are clear types of corespondence they are interested in.
As soon as you add additional organizations in the mix, that no longer works. Once of our clients may do extensive business in Russia while another might not care about Russia but does business in India. And so on. As a result, we have to be very careful about large blanket blocks of remote mail servers. Likewise, we have to be careful about what level of policing we apply to other aspects of message, like attachments, since many of our customers are simply people with vanity domains who get their personal mail there. They want to see those pictures of cute cats and so on. The unavoidable side effect of this is that our filters have to be somewhat looser to avoid inconvencing our users with too many false positives.
That said, we are always looking for ways to improve the filtering and the technology we use for filtering is kept up to date. Remember, we get spam too, so we really do want to improve the filter.