Comment spam, like referrer spam, is on the rise. As with referrer spam blocking, the problem we face is how to reduce the spam, and the cost in time spent dealing with it, without affecting genuine users. Ideally, a tool would exist that could identify spam of all types and remove it invisibly without ever flagging something genuine as spam. It's unlikely to ever exist - Bayesian spam filters have come about as close as we can hope to that ideal - so we need to look at alternative methods of dealing with comment spam.
Plenty of potential solutions have been touted recently for the comment spam problem. Some people seem to think that URL blacklisting is the way forward, some that email authentication will stop spammers, some that reporting them to their ISPs will work, and some that having spammers identify text within an image is a good way forward.
The simple fact is that all of these have been tried before, and almost every one have them has either failed or caused serious problems in other ways - so much so that they are not practically useful. We should be looking to areas where spam has been a problem for a while, and learning from the solutions that have been tried before, not trying things we've seen fail before.
Blacklists are basically large lists of URLs that are known to be promoted through spam. Unfortunately, spam blacklists really do not work very well. The idea is a good one, but there are major flaws in a system like this once you reach any significant size.
Genuine URLs can be listed on a blacklist (my site was identified by a blacklist filter a few weeks ago, because it partly matched a blocked URL), and that will irritate genuine commenters. It also requires a huge commitment in time to maintain a spam blacklist. Blacklists are ok in the beginning, but they cannot work forever.
* I am aware that my referrer spam blocking here uses a basic form of blacklist, however as I have mentioned, this is not, and cannot be, a long term solution.
Forcing users to enter a valid email address, checked by posting a validation code to that address, is a good way to prevent comment spam. Most spammers do not have the patience to wait for an email to arrive, when they can easily move on to the next site.
The problem with this tactic is that the same applies to regular users. Users will not want to have to register to comment, unless they can register for all the sites they use at once (TypeKey provides just this kind of service, as does Drupal for its users). I have registered once or twice when I've wanted to leave a comment on an article or post, but there have been tens of other sites where I just haven't bothered because I wasn't that desperate to comment that I was going to give away my email address (and risk it being sold) and wait for a validation email to arrive.
A system like TypeKey has the potential to help in the fight against comment spam, because it allows the investment of time required by a user to comment to be reduced overall - one registration allows for commenting at many sites. However, until many more sites use this specific system, registration is still going to be an annoyance to many for small reward. Not to mention the problems the people running the system are going to have with keeping spammers out - if a user can register for many sites at once, so can a spammer.
Turing Tests (CAPTCHAs)
Turing tests are designed to tell a person from a machine, and are pretty simple. They could be a question ("What is the capital of England?") or, more commonly, they could a CAPTCHA, an image containing a code that must be entered by the user to proceed. The image is usually blurred and distorted so that a user can read the code but a machine would have trouble.
The problem with Turing tests like these is that however good they are, they always introduce another step to the commenting process, which will make commenting that little bit less attractive to a user. It also presents a major accessibility problem, which makes it a no-goer for any conscientious designer. This method is extremely effective for reducing automated spam, but will reduce authentic comments too.
Reporting spammers to their ISPs is something many bloggers seem to have started to do. Unfortunately, most automated blog spam comes from innocent third parties, whose machines have been infected, making them part of a larger zombie network. This can be worth doing, if it makes you feel better. Spam is usually against an ISPs acceptable usage policy (AUP) and so reporting someone will often result in them receiving a warning or being booted off their ISP. However, it is extremely unlikely to actually reduce spam. All it will do is possibly irritate a spammer.
Bayesian Spam Filtering
Bayesian spam filtering works wonders with email. Since first using Bayesian filters many moons ago, I have received virtually no spam. In the last year, I have had 97% of spam I have received filtered out before it reached me. I have had 12 false positives (emails identified as spam when they are not) in that time. I can live with numbers like that.
However, though Bayesian filtering is effective for emails, it is not useful for comment spam. The main principle behind Bayesian filters is that for email spam to work, it must include a sales pitch, and the names of the things being sold. Whatever email spammers do, they must keep the names of products in their emails, and a sales pitch. Bayesian filtering can easily and very accurately spot a sales pitch, and thus filter mail accordingly.
Comment spam does not have the same goal as email spam. Comment spam is designed to improve the rankings of pages in search engines, by increasing those pages' link popularity. No sales pitch is needed, and though the product name is often mentioned, it is not needed either. Thus, Bayesian filters are unfortunately likely to be extremely ineffectual against comment spam.
So what can we do?
There are two types of comment spam, and the first thing to do is to differentiate between the two, as each needs to be treated and combatted differently. First is automated spam - comments by programs (or bots) that trawl the web looking for comment forms and fill them out, often with the help of "zombie" machines, computers that have been infected with a virus of some sort. The second type is manual spam - real people trawling the web and posting spam themselves - and this is often more difficult to beat.
Automated spam is easy to stop. A good start is to change the names of your form fields - calling a text input box "url" or "website" is asking for trouble. Ideally, you'd have the names change themselves regularly, perhaps even each time the page is loaded - not too tricky with any web programming language. This will prevent an automated program from spotting which item should go into which input box, and thus comment submissions will usually fail.
Second, make sure your comment entry form is on its own page, separate to your posts. An automated comment form detection tool will invariably grab the pages returned from a Google search for a specific query. It will then check those pages for a comment form, and fill it out (or note the URL) if it finds it. If your comment form is on its own page, it will likely not be found by an automated tool, as your posts will almost always outrank your comments pages in search engine results and often a bot like this will not follow links on a page it has retrieved.
Third, change the name of your scripts. Movable Type, one of the most common blogging tools out there, names its comment entry form "mt-comments.cgi". Spammers can use this script name in a Google search to identify Movable Type blogs specifically, and it is the same with almost all other blogging software. Change the name of this script, and you will make it a little harder for spammers to find your blog and spam it.
Last, log all IP addresses and user agents with comments. Spamming tools will often be identifiable through one or both of these, usually the IP address. Block IP addresses of persistent spammers if you wish, but be careful - sometimes many users of an ISP will share an IP address. AOL, especially, is well known for this. Sometimes the person who appears to be posting the spam is not - spammers use viruses to infect machines and have them post comments. You should not ban IP addresses unless you are fairly sure it is just the spammer who will suffer.
Manual spam is more tricky. Generally, the spammer will arrive from a search engine, searching for a specific keyword or set of keywords. Their IP address is a weakness, but sometimes this can't be used for blocking, and it requires that they post at least one spam comment before they can be blocked. Their user agent will be a normal browser, so you can't block them that way either. All is not lost, though.
First, remember that a spammer will invariably not read an article or post they are commenting on. Why would they? It might be a great article, but they are aiming to generate as many links as possible in the shortest time possible, not expand their minds. This is probably the spammer's most serious weakness.
The first, and easiest way, to reduce manual comment spam is to introduce an element of time delay to spammers. This is easiest if your comment entry form is on a separate page to your posts. Using the programming language of your choice, note the exact time a user lands on an article. When the user then loads the comment page (or posts a comment), see how long has passed. In the case of spammers, this will often be under 10-20 seconds - regular people will take the time to read an article, and will usually take at least a minute or two to move on.
If a user is too fast, you could block access to the page. You could place their comment in a queue for moderation, or flag it. You could just remove the URL from the comment. I would suggest that you don't actually say "You posted your comment too quick", because then the spammer knows how to work around your filtering - they'll just load your posts in the background and come back to comment later, when enough time has elapsed.
Another effective way to reduce comment spam is to use moderation. You can require all comments are approved before they are actually shown on the site. This method is extremely effective, but very time consuming. Most blogging software now includes this option. If you have the time to manage this kind of system, I'd recommend doing so. If you get too many comments for this to be practical, you'll need a more automated solution.
You could do what many sites do, and turn off the ability to comment on articles after a certain period of time. As with many spam prevention techniques, this method prevents your users doing something too. It is an effective way to reduce comment spam, because often by the time a page is indexed by a search engine and spammers can fine it, commenting is impossible. It is however also an effective way to prevent often-valuable comments being made on older articles or posts. The majority of comments on this site have been made well after the post itself was made, and I want to keep those comments coming.
This can be useful on sites that receive a lot of comments. Rather than checking comments on hundreds, or thousands, of posts, you only need to keep an eye on the newer items on your site, which massively reduces the time you'll spend on spam prevention.
You can also remove the value in spam very easily by simply allowing people to comment but without allowing them to post their URL when they do so.
In the end, the best thing you can do is work out for yourself how much time you can dedicate to comment spam prevention. If you have time to spare, and don't get massive numbers of comments, the best way to combat spam is to be dilligent and delete spam yourself when it turns up. If you're only receiving a couple of comments a day, you'd be wise not to place obstacles in the way of your regular users commenting. You want to encourage genuine comments, remember.
The single most effective way to completely stop comment spam is to remove all incentives for spammers to add comments. If spam has become a serious enough problem, turn off comments altogether or turn off the ability to post URLs with comments.