The recent RockYou.com password problems have spawned plenty of debate online about the best way to store passwords and build a site securely.
Part of being a good, security-conscious web developer is paranoia, and it's apparent that the RockYou.com developers could have used a little more of it. They made two mistakes in their work, not one. Their first, and most obvious one, is that they had a SQL injection hole somewhere. Their second was their assumption that their measures to protect their data were enough to do so.
A healthy dose of paranoia would have led their developers to make the opposite assumption - that whatever they did to protect the data, sooner or later someone would be able to access it.
The result of this second mistake is that, rather than simply announcing a security hole has been found and closed, they have had to deal with the fact that the passwords of more than 32 million people have been exposed, in plain text, to an unknown number of people. As most people use the same password for multiple places, and most will be unaware that this has happened, we can safely assume that the access details of millions of email accounts are in the open and unchanged. That's a bad day in code-land by anyone's standards.
The solution to the problem is to first assume that all data will be exposed at some point to an intruder of some sort. Once you assume that, it becomes important to ensure that the damage resulting from that exposure is minimal.
Which brings me on to hashes. Hashes are one-way functions that generate a representation, usually a number, of the data put in to them. They always generate the same hash from the same data, and there is no simple way to reverse the process.
This makes them incredibly useful for password storage. Instead of storing a user's password, you can store the hash of the password. When a user logs in again, instead of checking the password they type in against the one you have stored, you calculate the hash of the password they type in and compare that to the stored hash.
There are lots of different hashing algorythms, the most commonly used being MD5 and SHA1.
Unfortunately, ensuring passwords are stored securely isn't as simple as just using storing a simple hash of a password. Two of the strengths of hashes are also their largest potential weakness: they are small to store and quick to generate.
To generate SHA1 and MD5 hashes of every word in English, for example, takes moments. To store that amount of data is also trivial. To generate hashes of all combinations of letters and numbers, plus a few commonly used punctuation marks, up to say 8 characters, is much slower but still doable without any special setup or equipment.
Tables of precalculated hashes of data like this are easily found online or easily generated. If you have a hash of some data (like a password) and you want to see what that data originally was, you can compare the hash to the entries in your precalculated table. If you find a match, you have discovered the data that was originally used to generate the hash - the password you were trying to find out.
So basic password hashing is, essentially, useless for the majority of users. It is a simple process to compare hashes of basic passwords to a table of precalculated hashes and thereby "dehash" passwords en masse.
Some people recommend nesting hashes as a way to make add complexity and therefore more security. Unfortunately, to generate tables of nested hashes is almost as easy as plain hashes by themselves, and no more secure.
The solution is to hash more than just the user's password, and this process is called "salting". For example, instead of storing a hash of a user's password, you could store the hash of their email address and their password together.
This is effective because tables of hashes of generated data of more than about 10 characters start to become problematic to generate and store. At around that point, tables must be generated based upon dictionaries and known words, rather than on programatically generated lists of all possible passwords in a range.
The average length of "email plus password" is easily in the region of 25 characters. Not only that, but if someone worked out that you were using hashes of "email plus password", they would still need to generate a new table for every password they wanted to dehash.
This level of complexity, added to a reasonably strong password policy, ensures that if (or when) your user data is exposed, the work involved in extracting usable passwords from it is going to stop all but the most determined attackers. Not only that, but even they will find extraction of data in bulk prohibitively difficult.