Today's comment spam review process:
DELETE FROM comments WHERE moderation_required = 1; Deleted rows: 9699 (Query took 2.0489 sec)
Apologies to anyone whose comment ended up in the moderation queue. I normally keep on top of it, but a couple of weeks of putting it off and what started as a small pile of comments to manage has quickly ballooned into a 10,000 comment monster. Time for a comment spam prevention rethink, I reckon.
14 December 2005 | 12 comments | folksonomy, tagging, article, del.icio.us, spam, tags, directory
del.icio.us describes itself as a social bookmarking site, for it allows all users of the net to share their bookmarks with others. Unlike similar enterprises that went before it, users' shared bookmarks are not listed only under their name. Each bookmark also has a set of "tags" associated with it. These tags are words that identify what that page is about. An article on, well, security in PHP would have the tags "php" and "security". Maybe even "webdev" and "programming" as well.
Each user can pick their own tags for their own bookmarks. You can also browse all available bookmarks by tag - meaning that if you wanted to see all bookmarks about PHP, you would simply browse to the PHP tag, and voila - there you would find all bookmarked pages about PHP.
What's Wrong with Directories?
Tags are an intelligent way of organising data. Regular directories work on a similar basis to a filing cabinet - items are stored within folders within drawers, and often only to be found in one place. Of course, that fails to make use of the power of databases and computers. Tags, on the other hand, allow a single item to be found in all the places it should be, rather than just the one place that is the single best fit.
This site, for example, is listed at DMOZ under "Web Design and Development: FAQs, Help, and Tutorials" - a good fit, but it also contains writing on internet marketing, browsers, usability and accessibility, and there are resources available as well as a blog. DMOZ cannot reflect this with its rigid and antiquated structure. At del.icio.us, however, it is associated with the following tags: css, php, design, web, programming, blog, webdev, blogs, development, webdesign, reference, cheatsheet, resources, code, mysql, tips, apache, web-design, tools, tutorial, tech, tutorials, computer, web-dev, html, database. And that's just the front page - specific articles are all listed with their own tags. This means that sites and pages listed by users of del.icio.us are classified and organised in a much more effective and user-friendly way.
There are more serious flaws in the directory model though. The most significant problem with web directories is the editors themselves. A web directory requires editors in order to function, and these can either be paid employees or volunteers. If your directory has paid editors working for it, you are left with no serious choice but to charge a fee for submission, in order to cover your editors' wages. That system scales pretty well - if the directory succeeds and becomes popular, the fees for the extra submissions should be enough to cover the wages of the extra editors required to process those submissions.
A volunteer system does have advantages over the paid-editor model. Because volunteers are not paid, a listing in a directory with volunteer editors can be free. This means that non-profit information sites and low-traffic sites can be listed in the directory (a fee for submission will usually prevent that), and means that editors can go out and find sites themselves to be listed.
Both of these systems have their problems. Volunteer editors are volunteers - making it much harder to hold them accountable for laziness or incompetence. DMOZ - a directory with around five to ten thousands volunteer editors - is a great example of this: submissions are often not processed for many months, if at all. Also, because it is a volunteer position, uncrupulous folk are far more likely to accept bribes to list or de-list sites - they stand to lose very little if discovered - and there are plenty of people who claim that a great many editors do just that. The system can also scale badly - if tens of thousands of submissions suddenly require processing, it can be very difficult to source the hundreds or thousands of editors needed to manage that influx.
The paid system leads to an exclusive directory, which by definition becomes one that is missing out on a huge amount of quality content. Yahoo's fees of hundreds of dollars for a submission have always seemed to many to be completely disproportionate to the benefit of a listing and for many people are higher than the cost of hosting a site or the income from it. As a result, for a long time (and this is still true to a great extent) Yahoo has been an incomplete directory, lacking the in-depth listings required by today's discerning web surfer.
Why is del.icio.us better?
del.icio.us is different to both of these systems. It is similar to a peer-review system, in fact. One person bookmarking one page can count as a vote for that page. As the user will have added tags as well, their vote tells the system that one person believes that the page in question is related to each of the tags they listed. After a few hundred people have bookmarked the same link, you'll begin to see some tags used more than others, giving you an idea of how closely the target page relates to each of those tags.
What this has created is a kind of directory with a distributed editting system. The editors are volunteers, but because of their sheer numbers it it much harder for any one editor to affect listings. If one editor is lazy, it does not matter - there are thousands more covering the same topic. If an editor makes a mistake, and lists a page or site under the wrong tag, it doesn't matter - the huge numbers of other editors will make up for it.
Spam is likely to become a huge problem for del.icio.us. To a degree, there is already spam within the index, and some work has already been put in to preventing spam. del.icio.us's robots.txt file prevents indexing of the whole site, so no link popularity benefits from spamming the index directly. However, del.icio.us can still generate plenty of traffic and by virtue of the RSS feeds it generates can generate link popularity from other sites.
Luckily, there are plenty of signals they could look for to weed out spam. Their database can already tell them what tags are related, and what sites. If a user starts to list unrelated sites with tags unrelated to those sites, they may well be a spammer. If lots of new users suddenly join and all bookmark the same page instantly, using the same tags, again that may well be spam. IP tracking and the registration system (that requires a valid email and features a turing test) should make automated spam far harder. Ultimately, it may be del.icio.us's own success that makes spamming virtually impossible. With enough users on the site, a spammer may need to create hundreds, even thousands, of fake users to have a site listed in the "popular" section, or listed highly for a specific tag.
The intention of del.icio.us is not (at the moment) to become or create a directory. It is by pure fluke that they have created a site and a system so able to perform the same function as a directory but without the problems associated with that. It may well be that other similar sites will spring up whose aim is to build a directory, especially those involved in search, with a large user base. I would be greatly susprised if Google, MSN and Yahoo were not already watching del.icio.us very closely and with great interest. I would be equally surprised if none of them bought or created a social bookmarking product in the next few months, as the power of a distributed editting becomes apparent.
Update (14th December 2005)
It appears that I was rather close to the mark with my guess at what would happen next for del.icio.us - they have just been bought by Yahoo. It will be interesting to see how Yahoo integrate del.icio.us with their other services. I only hope they don't mess it up like so many promising sites before!
14 September 2005 | 43 comments | howto, webdev, server, spam, referrer, wordpress, htaccess, admin, apache
Referrer spam is becoming increasingly common. At best, it will only render your log files useless. At worst, it can cause your site to be dropped by search engines and your running costs to skyrocket. Here's how to block spurious referrers.