Technical commentary and opinion.

Reducing Spam: Greylisting and the Temporality of E-mail

March 12, 2008 - Mark

Greylisting and the Temporality of E-mail

In the past few years, as spam has increased to be over 90% of all e-mail sent on the Internet, several solutions have been created to help deal with the problem.  At Heluna, we’re proud of our filtering system, and we’re constantly looking at new technology to help our clients see even less spam.  We’ve seen quite a few different solutions, some very useful, and some…  more interesting ones.

Greylisting falls into the latter category.  The basic idea behind greylisting is this: for every message that gets sent to your inbox from a new sender, your mailserver instead places that message into a quarantine.  It then sends out an e-mail to the sender, usually with a link to click on, or asking to reply to a specific address, or some other mechanism to have the sender authenticate that they actually sent the message.  Once the sender has authenticated, the mailserver releases the original message from the quarantine, delivers it to you, and then places the sender on an approved senders list.

Greylisting advocates and companies are often claiming that their solution provides 100% accuracy.  The primary claim is that spammers — even those that use their real e-mail address, rather than a forged address — will never respond to the “challenge” e-mail, and so the message will never make it to your inbox.  Additionally, greylisting companies will challenge based upon the originating e-mail server, so even if a spammer manages to forge an e-mail address that is in your approved list, the message will still get challenged — and as a result, will not go through.

This sounds like an enticing solution, one that can prove that a human sent the message, but greylisting also has an incredible amount of shortcomings that, over time, make it much more of a hassle than a benefit.

First: many companies that send out automated e-mails (Yahoo, eBay, Amazon, among others) send each message to you from a custom sender address that changes each time they send you a message.  When your greylisting solution responds to that message, these vendors will often treat that as a bounced message, and will take your e-mail address off their list.  Imagine following an eBay auction, only to not know when it was ending because the e-mail notice never made it to your inbox.

Next, a vast, vast majority of people will simply never click on the greylisting link (or respond to the greylisting e-mail).  It can lead to confusion, lost e-mails, and ultimately manual intervention to add those senders to your approved list yourself.  This can be far more of a hassle to deal with on a daily basis, since you will now need to guess at who is sending you messages.  Is this even possible?  The alternative is to constantly check your quarantine, which defeats the purpose of greylisting.

Last, assuming that people do complete the greylisting challenge, and the message is delivered to your inbox, the timeliness of the e-mail is based upon the sender’s timeliness of checking their e-mail, responding to the message, and then having that message then re-sent to your mailserver.  This can introduce an enormous — and in our opinion, unacceptable — delay to the delivery of the original message.

We’re keeping an eye on how greylisting is evolving, but for now, the Heluna service performs extremely well without greylisting.

 

The 12 Days of Spam-mas

December 17, 2007 - Mark

Here at Heluna, we stop quite a few spam messages from getting to their intended destination.  To get into the holiday spirit, then, we present to you: the 12 days of spam-mas, a list of keywords and the number of times those keywords have shown up in the sender e-mail address over the past 30 days:

“partridge”: 27 times

“dove”: 293 times

“french”: 492 times

“calling”: 92 times

“gold”: 15156 times

“geese”: 26 times

“swans”: 178 times

“maids”: 26 times

“ladies”: 43028 times

“lords”: 32 times

“piper”: 170 times

“drummer”: 15 times

We won’t even get into the number of times that “true love” shows up in the subject line of rejected spam messages.  Happy holidays, everyone!

 

Java Scalability 101: Page Fragment Caching

November 29, 2007 - Mark

Note: this is the second in a series of articles on achieving a faster application response and scalability for small businesses, concentrating upon free or inexpensive solutions that are easy to implement.

One of the rising stars in the open source world is memcached, billed out as “a high-performance, distributed memory object caching system”. Memcached has an API that can be accessed via nearly every web programming environment, and is a valuable tool for storing frequently-accessed objects, large pieces of content, and other items that need to be retrieved quickly (versus grabbing them from a database).

What many Java programmers may not be aware of is a comparable system that runs natively inside your application framework. OSCache allows for nearly all of the same functionality as memcached, and it doesn’t require a separate process to be running; everything runs within the context of your web app (or your container). It can also be accessed remotely, or clustered across servers (via native clustered shared memory in solutions such as Jboss or the newer Tomcat).

One of the better uses for OSCache, however, is its JSP Tag library. Assume for a moment that you have the following:

Hello,
<%
// extremely expensive JSP logic that winds up
// returning the word "world"
%>

If the “extremely expensive JSP logic” portion of that code were something that did a large SQL query, or had to go grab data off a remote service, or had to parse XML, etc etc, and the resulting output didn’t actually change that often, you could use OSCache to cache that individual page fragment into memory, like so:

Hello,
<cache:cache>
<%
// extremely expensive code
%>
</cache:cache>

The first time this JSP page was run, the expensive logic would be run, and the results would be displayed. However, on each subsequent access (up to a period of time) the resulting HTML would simply be pulled from an in-memory cache and immediately displayed to the user, a much faster operation (resulting in a much faster page load time).

What does this mean for real world sites? It means that even if you were to set the cache timeout to something extremely low, eg 5 seconds, any bottleneck caused by this expensive piece of code should go away. This helps with traffic spikes, eg getting mentioned on the larger social news sites, in traditional media, or even helping to prevent denial of service attacks. A few well-placed page fragment caches on each page can speed up that page immensely.

The OSCache site details all of its functionality, but for mid-sized JSP-based sites, adding a page-level cache can increase response times by an order of magnitude, especially under heavy load. Caching smaller fragments of a page, while not as effective as caching an entire page, can still be useful in that all of the content on your page that doesn’t change very often is much better served by being pulled directly from memory 99% of the time.

Are you using OSCache? Share your experiences in the comments below.

 

Amazon EC2 Limitations

October 16, 2007 - Mark

Amazon’s Elastic Cloud Computing utility service has the capability to drastically change how Internet services are deployed, but it has some weaknesses that developers should consider before switching their applications to use EC2 functionality.

Today, Amazon rolled out their EC2 computing-as-a-utility cloud as an “unlimited beta”, which essentially means “anyone can sign up for EC2, but if things break, don’t blame us.” EC2, and virtualization in general, will undoubtedly change the future of Internet application deployment– even the smallest website, after a small list of code changes, will have the ability to scale to huge proportions. Amazon’s “pay for what you use” model is drastically different than the traditional “rent-a-box” model of hosting providers, and ultimately will likely become the route that more hosting providers take. For example, MediaTemple’s Grid Server is a more rigid container, but in that same utility model.

However, EC2 has some limitations that make it currently cost prohibitive vs. traditional hosting models, especially for specialized applications such as Heluna.

First, let’s be clear: for performance spikes, EC2 is a great equalizer. The ability to automatically bring up instances when needed, and bring them down after you’re done with them, gives developers a unique view into what resources their applications actually need.

The Heluna antispam service, as you might imagine, mostly requires CPU cycles in order to run the large amount of tests against every e-mail. Those CPU cycles follow a predictable pattern– just look at our statistics graphs– so EC2 instances could be brought up and down, however we also require a base amount of CPU cycles that grow over time and with each client that signs up. Also, the Heluna service has been planned out so that we are able to deal with all but the largest e-mail spikes, without having to bring up additional computing resources on the fly.

We have a rough idea of how many cycles the average message takes to scan, so based upon our current workload we can say how many Ghz of computing resources we need. The trouble with EC2 here is that the cost of CPU cycles is high (compared to the cost of memory and disk space availability, which is relatively low). While Amazon may reward the developer that gets lots of large activity spikes, they penalize the service that has a predictable amount of always-needed capacity.

Per Amazon’s documentation, “One EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.” With a cost of $0.10 per 1Ghz per clock hour, a base “always in use” requirement of 5Ghz would run $0.50 per hour, or $360 per month. Contrast that with a hosting provider offering a dual processor dual-core 2Ghz-per-core solution (essentially, 4×2Ghz or 8Ghz) for $339 per month.

The EC2 model would offer more memory, but currently Heluna is not a memory bound application. Our goal is to get messages into memory, process them, then either dump them into our database (per the user’s quarantine), discard them, or ship them off to the client’s mail server. Either way, operation of the service does not require an enormous amount of memory.

Lastly, Heluna does require a fair amount of bandwidth. Again, as you can imagine, most of our bandwidth is incoming (completely different from almost every other Internet application) but for each message we process, there are a collection of network tests performed, including DNS checks, distributed checksums, having to make round trips to our database, etc, and finally routing a clean message to its rightful place. 1000GB of inbound bandwidth would cost $100. In contrast, almost every hosting provider will include a set amount of bandwidth with their monthly fee; in many cases, it can be at least 2000GB.

In summary: if your application requires a considerable amount of memory, with little to no CPU consumption, and you either do a very small amount of bandwidth or a very, very large amount of bandwidth, EC2 may be the perfect platform for you. Also, if your application is still using a very small amount of resources, the base single-EC2 unit would be very useful. However, until the cost of CPU cycles comes down at Amazon, Heluna will continue to be hosted using a more traditional hosting model.

 

Reducing Spam: MX Records

October 11, 2007 - Mark

A quick, easy way to help reduce the amount of spam attempts to your domain is in the use of “bad” MX records. With the correct MX setup, properly-formatted mail servers will still send e-mail to you, while a large portion of spammers will bypass your domain.

First, a quick tutorial. A DNS MX record is a “mail exchanger” record for your domain; it tells mail servers how to send e-mail to your domain. The MX record is set to either your own mail server, or (in the case of smaller domains) your ISP’s mail server. There can be more than one MX record per domain, which is why each MX record contains a number indicating its priority. For example, the MX records for cnn.com are:

cnn.com mail exchanger = 30 lonmail1.turner.com.
cnn.com mail exchanger = 40 hkgmail1.turner.com.
cnn.com mail exchanger = 10 atlmail3.turner.com.
cnn.com mail exchanger = 10 atlmail5.turner.com.
cnn.com mail exchanger = 10 nycmail1.turner.com.

As you can see by the records, any mail server trying to send e-mail to someone@cnn.com needs to connect to the machines “atlmail3.turner.com” or “atlmail5.turner.com” or “nycmail1.turner.com”. If those machines are offline, e-mail should go to “lonmail1.turner.com” and finally, if that machine is down, e-mail should go to “hkgmail1.turner.com”.

Now, many spammers choose to ignore this rule and will instead only attempt to go to one of the MX records and then give up. In our case, if a spammer tried to connect to “atlmail3.turner.com” and it was offline, the spammer would move onto the next spam target and CNN wouldn’t get that piece of spam.

Armed with the knowledge that well-behaved mail servers will always try the next MX record, and many spammers will abandon their attempt, a simple strategy appears: always have the first MX record fail. If the first MX record for your domain always rejects connections, the amount of spam attempts should drop, while legitimate e-mail will still go through.

So, let’s say your e-mail server is located at “mail.yourdomain.com”. Your MX records probably look like:

yourdomain.com mail exchanger = 10 mail.yourdomain.com.

We need to add an MX record that always rejects e-mail connections, like so:

yourdomain.com mail exchanger = 5 badmx.yourdomain.com.
yourdomain.com mail exchanger = 10 mail.yourdomain.com.

(Obviously in this scenario you would need to set up the IP address for “badmx.yourdomain.com” and have it point at a machine that did not have e-mail service on it.)

Now– this will reduce the amount of spam attempts, but many spammers will actually choose the highest priority MX record as the target of their attack. In our case, a spammer will go directly to mail.yourdomain.com without trying badmx.yourdomain.com. To solve this issue, we add a second MX record with a higher priority than your mail server. To revisit our previous example, the MX records would then look like:

yourdomain.com mail exchanger = 5 badmx.yourdomain.com.
yourdomain.com mail exchanger = 10 mail.yourdomain.com.
yourdomain.com mail exchanger = 20 badmx2.yourdomain.com.

This would catch a large amount of spammers, and should still allow valid e-mail servers to send you e-mail.

For smaller domains that don’t have the ability to arbitrarily add or remove hosts but can still make changes to their DNS, Heluna offers two MX records that perform exactly this function. “reject1.heluna.com” and “reject2.heluna.com” both refuse any e-mail connections, so your list of MX records can look like so:

yourdomain.com mail exchanger = 5 reject1.heluna.com.
yourdomain.com mail exchanger = 10 mail.yourdomain.com.
yourdomain.com mail exchanger = 20 reject2.heluna.com.

Be sure to adjust the records to fit your domain; make sure that one of the Heluna servers is a lower priority than your mail server, and that the other Heluna server is a higher priority than your mail server.

Did this solution help you? Are you using the Heluna reject servers to reduce your incoming spam attempts? Let us know in the comments!

 

Java Scalability 101: Volume 1, Web Servers

September 21, 2007 - Mark

Note: this is the first in a series of articles on achieving better application response and scalability for small businesses, without having to spend an enormous amount of money on hardware.

Many application developers tend to eschew a web server for the built-in solution that comes with most popular application servers (Tomcat, Jboss, Resin, BEA, etc). While this is acceptable for a small-traffic website, this solution simply can’t hold up against a large amount of traffic, or against larger traffic spikes. When your application becomes popular overnight, and you wake up to a huge amount of web traffic overloading your application server, it can be incredibly frustrating to see your application performing so poorly, yet your physical hardware so underutilized.

One famous bottleneck at the application server is simply due to HTTP keepalives. One client will, on average, tie up 2-4 threads of your application server for as long as your keepalive timeout (normally, 10-30 seconds). With the average app server able to handle 255 threads maximum, it only takes 50-100 simultaneous end users before your application grinds to a halt. Even reducing the keepalive time on your app server (if this config change is even possible) will only lessen the impact of browsers.

Enter the web server. One well-formed Apache configuration will handle thousands of simultaneous end users, and will overload your bandwidth long before overloading your memory, disk I/O, or application server threads. Simply adding the web server can make an enormous difference, however the greatest impact Apache has is when it is serving up static content for your application, leaving your application server to handle only your java-enabled pages (jsp, struts, jsf, etc).

First, the web server compilation and initial configuration: we recommend the Apache 2.0 branch vs the 1.3 or 2.2 branch, for two reasons. One, the 2.0 branch allows for a hybrid multi-threaded, multi-process model (the “worker mpm module”)– this is an incredible amount of resource savings over the 1.3 model of one-process-per-request. One MPM worker can serve hundreds of simultaneous requests. The 2.2 line is not yet completely compatible with all application server connectors, so 2.0 works well for us.

We configure Apache with the maximum amount of shared modules, so that removing modules results in a lower memory footprint.

./configure --prefix=/usr/local/apache2 
    --enable-mods-shared="all" --disable-ssl 
    --enable-ssl=static --enable-so 
    --with-mpm=worker --with-ssl 
    --enable-deflate=shared

This tells the configuration script to enable every module as a shared library except for SSL (which must be statically linked). It also tells Apache to use the MPM worker module.

Once the server is compiled and installed, the initial configuration: go through your httpd.conf and strip out all unnecessary LoadModule lines (to conserve memory). Also, pay attention to this section:

<IfModule worker.c>
StartServers 2
MaxClients 1000
MinSpareThreads 25
MaxSpareThreads 75
ThreadLimit 100
ThreadsPerChild 100
MaxRequestsPerChild 100000
</IfModule>

With this simple configuration, your web server will be able to handle 1000 simultaneous clients; raising this number as necessary once you run into traffic issues and Apache will take care of scaling up on its own. Setting MaxRequestsPerChild to be non-zero results in fewer memory leak issues, as the MPM worker children processes get killed off, returning the memory to the OS.

Each application server has its own connector for Apache, usually in the form of a shared module. These also have their own configuration files. In the case of Tomcat, a simple workers.properties file will suffice, with the following (your locations may vary):

workers.tomcat_home=/usr/local/tomcat
workers.java_home=/usr/java/jdk1.5.0_12
worker.local.port=8009
worker.local.host=localhost
worker.local.type=ajp13
worker.status.type=status

Lastly, inside your VirtualHost entry for your host, add:

JkMount /*.jsp local
JkMount /status status

This will send all requests ending in *.jsp to your application server; obviously if you are running struts/etc this will need to change to your page extension or servlet path. Some developers choose to send everything to the app server except for some types; in that case, JkMount /* local and then JkUnmount /*.jpg local would work to only have jpeg files served up from Apache.

Hopefully this has shown that for application server scalability, adding a web server in front of your app server is a good first step. Static content on the filesystem is Apache’s specialty, and is where you can get a huge amount of speed gains. Your application server threads can be saved for actually serving up dynamic content, while the huge amount of Apache threads can serve up content which doesn’t change from client to client.

 

Online Virus and Spyware Checker

August 28, 2007 - Mark

Panda Software has created a site “Infected or Not” which allows you to quickly scan your PC for well-known virus and spyware infections, and also shows an interesting set of statistics of the people who used their software.

According to their counter, currently 15% of the PCs that run their software have some soft of virus or spyware embedded on their machine, with another 20% not running any antivirus software at all.  Without knowing the real numbers, it’s hard to extrapolate this statistic.  One would think that people who had suspicions that their machine was infected would visit the site, so the number may skew a little high, but even then it’s a frightening number.  From their statistics page, 3.78% of PCs that run their software have a Trojan installed on them.  That paints a very bleak picture of how many machines out there could be used as spamming tools.  (Traditionally, compromised home PCs are used by spammers to send out their spam en masse.)

Their software doesn’t work for the Mac, but then Heluna is currently unaware of any active Trojan or spyware installations on the Mac platform.  Did this tool find any virus or spyware infections on your machine?

 

Consumer Reports: $7 Billion Lost to Online Threats

August 25, 2007 - Mark

The latest Consumer Reports survey states that US Consumers have lost more than $7 billion in the past two years due to viruses, spyware, and phishing e-mails.

From Consumer Reports:

Based on survey projections, computer virus infections prompted an estimated 1.8 million households to replace their computers in the past two years and 850,000 households to replace computers due to spyware infections in the past six months. Additionally, 33 percent of survey respondents did not use software to block or remove spyware. And CR projects that 3.7 million US households with broadband remain unprotected by a firewall.

If one were to assume that only 10% of those 1.8 million compromised machines were turned into spam-sending zombies, that would be 180,000 new sources of spam in the past two years.  Keep in mind that these were the households that actually decided to spend their hard-earned money to replace an otherwise good machine.  Imagine the number of families that have a compromised machine that may “run a little slow from time to time” but they keep around because they’re unwilling or unable to clean the machine or purchase a new one.  The potential for spam-sending zombies sitting inside the average household is enormous.

As for the impact of spam itself upon these families:

Consumer Reports’ survey respondents have reported a lower proportion of spam reaching their Inbox than in the past, which CR believes is a result of better spam-blocking. Survey results indicate that about 650,000 consumers ordered a product or service advertised through spam in the month before the survey. Additionally, in 5 percent of the households surveyed that had children under 18, a child had inadvertently seen pornographic material as a result of spam.

Whenever we at Heluna read these types of results, we’re always astonished at the amount of money actually spent upon products advertised by spam.  The one easiest way to slow the flood of spam is to stop purchasing items advertised in  spam messages.

 

Our New Site

- Mark

You have no doubt browsed our new site, which we’re very proud of.  The new version of the site allows for much quicker rollout of the new features that we’ve been coming up with.  Clients will notice that their quarantine is much faster, and easier to manage now, while regular browsers will notice how much faster the statistics are generated.

The Heluna blog is a place for more general discussion about spam, the Internet, and other various technical items, but for now we just wanted to point out the site relaunch, and thank our clients for all of their support.  If you have any ideas for topics, please feel free to leave comments either here or in e-mail.