Tagged with "programming" http://www.addedbytes.com/feeds/tag-feed/ en Web Development in Brighton - Added Bytes 2006 120 Why You Should Always Salt Your Hashes http://www.addedbytes.com/blog/why-you-should-always-salt-your-hashes/ The Problem

The recent RockYou.com password problems have spawned plenty of debate online about the best way to store passwords and build a site securely.

Part of being a good, security-conscious web developer is paranoia, and it's apparent that the RockYou.com developers could have used a little more of it. They made two mistakes in their work, not one. Their first, and most obvious one, is that they had a SQL injection hole somewhere. Their second was their assumption that their measures to protect their data were enough to do so.

A healthy dose of paranoia would have led their developers to make the opposite assumption - that whatever they did to protect the data, sooner or later someone would be able to access it.

The result of this second mistake is that, rather than simply announcing a security hole has been found and closed, they have had to deal with the fact that the passwords of more than 32 million people have been exposed, in plain text, to an unknown number of people. As most people use the same password for multiple places, and most will be unaware that this has happened, we can safely assume that the access details of millions of email accounts are in the open and unchanged. That's a bad day in code-land by anyone's standards.

Hashing

The solution to the problem is to first assume that all data will be exposed at some point to an intruder of some sort. Once you assume that, it becomes important to ensure that the damage resulting from that exposure is minimal.

Which brings me on to hashes. Hashes are one-way functions that generate a representation, usually a number, of the data put in to them. They always generate the same hash from the same data, and there is no simple way to reverse the process.

This makes them incredibly useful for password storage. Instead of storing a user's password, you can store the hash of the password. When a user logs in again, instead of checking the password they type in against the one you have stored, you calculate the hash of the password they type in and compare that to the stored hash.

There are lots of different hashing algorythms, the most commonly used being MD5 and SHA1.

Are Hashes Secure?

Unfortunately, ensuring passwords are stored securely isn't as simple as just using storing a simple hash of a password. Two of the strengths of hashes are also their largest potential weakness: they are small to store and quick to generate.

To generate SHA1 and MD5 hashes of every word in English, for example, takes moments. To store that amount of data is also trivial. To generate hashes of all combinations of letters and numbers, plus a few commonly used punctuation marks, up to say 8 characters, is much slower but still doable without any special setup or equipment.

Tables of precalculated hashes of data like this are easily found online or easily generated. If you have a hash of some data (like a password) and you want to see what that data originally was, you can compare the hash to the entries in your precalculated table. If you find a match, you have discovered the data that was originally used to generate the hash - the password you were trying to find out.

So basic password hashing is, essentially, useless for the majority of users. It is a simple process to compare hashes of basic passwords to a table of precalculated hashes and thereby "dehash" passwords en masse.

Some people recommend nesting hashes as a way to make add complexity and therefore more security. Unfortunately, to generate tables of nested hashes is almost as easy as plain hashes by themselves, and no more secure.

Add Salt!

The solution is to hash more than just the user's password, and this process is called "salting". For example, instead of storing a hash of a user's password, you could store the hash of their email address and their password together.

This is effective because tables of hashes of generated data of more than about 10 characters start to become problematic to generate and store. At around that point, tables must be generated based upon dictionaries and known words, rather than on programatically generated lists of all possible passwords in a range.

The average length of "email plus password" is easily in the region of 25 characters. Not only that, but if someone worked out that you were using hashes of "email plus password", they would still need to generate a new table for every password they wanted to dehash.

This level of complexity, added to a reasonably strong password policy, ensures that if (or when) your user data is exposed, the work involved in extracting usable passwords from it is going to stop all but the most determined attackers. Not only that, but even they will find extraction of data in bulk prohibitively difficult.



]]>
Wed, 16 Dec 2009 10:24:32 +0000 http://www.addedbytes.com/blog/why-you-should-always-salt-your-hashes/ Dave Child ,,,,
Writing Secure PHP, Part 4 http://www.addedbytes.com/articles/writing-secure-php/writing-secure-php-4/ Writing Secure PHP series, covering cross-site scripting, cross-site request forgery and character encoding security issues.

In Writing Secure PHP, Writing Secure PHP, Part 2 and Writing Secure PHP, Part 3 I covered many of the common mistakes PHP developers make, and how to avoid some potential security problems. This article covers some of the more advanced security problems common to PHP on the web.

Cross-Site Scripting (XSS)

Cross-site scripting (often abbreviated to XSS) is a form of injection, where an attacker finds a way to have the target site display code they control. In its most basic form, this can be as simple as a site that allows HTML characters in usernames, where someone can specify a username like:

DaveChild<script type="text/javascript" src="http://www.example.com/my_script.js"></script>

Now, whenever someone sees my username on the target site, the script I've added to my username will run. I could potentially use this to grab the person's login information, log their keystrokes - any number of nefarious activities.

As a developer, you can combat this type of attack by encoding or removing HTML characters (watch out for character encoding issues, as outlined next). Even better than stripping out unwanted characters is to allow a whitelist of safe characters in usernames and other fields. Be especially careful with e-commerce sites where you are listing orders in a CMS - an XSS vulnerability may allow an attacker to gain administrative access to your CMS. It is also important to turn off TRACE and TRACK support on the server, as if there is a vulnerability (and always assume that despite your best efforts there will be) these potentially allow an attacker to steal a user's cookie.

As a user you are also vulnerable to this sort of attack, and it is very difficult, at the moment, to make yourself safe against it. Vigilance is key, and to that end I have released a userscript that warns you about third party scripts (for users of GreaseMonkey, Opera or Chrome).

Cross-Site Request Forgery (CSRF)

Despite the similar name, CSRF is unconnected to XSS. CSRF is a form of attack where an authenticated user performs an action on a site without knowing it.

Let's assume that Jack is logged in to his bank, and has a cookie stored on his computer. Each time he sends an HTTP request to the bank (i.e., views a page or an image on a page) his browser sends the cookie along with the request so that the bank knows that it's him making the request.

Jill, meanwhile, runs a different website and has managed to get Jack to visit it. One of the items on the page is in fact loaded from the bank, for example in an iframe. The URL of the iframe or request contains instructions to the bank to transfer money from Jack's account to Jill's. Because the request is coming from Jack's computer, and includes his cookie, the bank assumes it is a legitimate request and the money is transferred.

This type of attack is extremely dangerous and virtually untracable. As a developer, your job is to protect against it, and the best way to do that is to remember Rule Number One: Never, Ever Trust Your Users. No matter how authenticated they are, do not assume every request was intended.

In practical PHP terms, you can combat CSRF with several relatively simple coding habits. Never let the user do anything with a GET request - always use POST. Confirm actions before performing them with a confirmation dialog on a separate page - and make sure both the original action button or link and the confirmation were clicked. Even better, have the user enter information like letters from their password on the confirmation page.

Add a randomly generated token to forms and verify its presence when a request is made. Use frame-breaking JavaScript. Time-out sessions with a short timespan (think minutes, not hours). Encourage the user to log out when they've finished. Check the HTTP_REFERER header (it can be hidden, but is still worth checking as if it is a different domain to that expected it is definitely a CSRF request).

Character Encoding

Character encoding in PHP and associated database systems is worthy of its own series. In any one request, there may be more different character encodings in use than you might think.

For example, a single request and response (uploading a file to a server and writing information to a database) may involve all of the following differently items with different character encodings: the HTTP request headers, post data, PHP's default encoding, the PHP MySQL module, MySQL's default set, the set of each table being used, a file being opened and read, a new file being created and written, the response headers and the response body.

English-speaking developers generally don't have much cause to get embroiled in character encoding issues, and that results in a lot of developers with a serious lack of understanding of how character encodings work and fit together. For those that do have a reason to look at character encodings, usually that interest ends with the setting of the response's character set.

However, character sets are a fundamental part of all web development. English alone can exist in any one of a wide variety of sets, and developers are usually familiar with the most common two: ISO-8859-1 and UTF-8. Fewer are familiar with UCS-2, UTF-16 or windows-1252. Still fewer are familiar with commonly used alternative language sets (e.g, GB2312 for Chinese).

Which, in a very roundabout way, brings me on to the security pitfalls of character encodings. Where data is processed by PHP using one character set, but a database server uses a different character set, a character (or series of characters) deemed safe by PHP may in fact allow SQL injection against the database.

PHP security expert Chris Shiflett has written about this issue and included an example of how it can be exploited to allow SQL injection even where input is sanitized using addslashes().

The solution is to always always use mysql_real_escape_string() rather than addslashes() (or use prepared statements / stored procedures), and to explicitly state character sets at all stages of interaction. Ideally, use the same character set throughout your system (UTF-8 is recommended) and where PHP allows you to specify a character encoding for a function (e.g., htmlspecialchars() or htmlentities()), make use of it.

It's not just SQL that's vulnerable as a result of character encoding bugs. Cross-site scripting is possible even where HTML characters are escaped if character sets are not handled properly. Fortunately, once again that is simple to avoid by properly setting character encodings at all stages of the process and specifying character encoding for functions where possible.



]]>
Thu, 11 Sep 2008 13:11:14 +0100 http://www.addedbytes.com/articles/writing-secure-php/writing-secure-php-4/ Dave Child ,,,,,,,,,,,
What Makes a Great Developer? http://www.addedbytes.com/blog/what-makes-a-great-developer/ What makes a truly great developer? Some might say a positive attitude. Some might say a high-sugar, high-caffeine, high-bacon diet. Some might say an absence of sunlight and as many monitors as a desk can support.

Certainly, everyone has anecdotes about developers they've worked with who they thought were brilliant. Unfortunately, most of the time that judgement is made not based on code quality, or hitting of deadlines, but on less relevant criteria, like whether or not the developer knew the names of their colleagues, how many lines of code they output or how confident they sounded when talking about their work.

Unfortunately, the best developers don't always come across positively. While this list may not be applicable to every development environment, here are a few of the traits to look out for to spot a great developer.

Pessimistic

Great developers are almost always pessimistic with regard to their work. That doesn't mean they're not upbeat, lively or even cheerful - just that they will always be thinking about what can go wrong and how it can be dealt with.

They'll assume that at some point they'll need to undo work already completed, that hardware will fail, that all security will be compromised, and that your office will burn to the ground. The really brilliant ones will assume that will all happen on the same day. And they won't be happy until there is a specific, actionable, testable - and fully tested - plan for dealing with these sorts of issues. Even then they won't be completely happy.

Pessimistic developers will be the ones that find constant flaws in ideas, and the important thing to remember when working with them is that they're not doing that to tear down other people's ideas - they're doing it to ensure that the ideas that turn into projects are properly thought through and that as many problems as possible have been anticipated in advance. That neurotic, paranoid, pessimistic attitude is exactly what you should be looking for if what you want from your developers is robust, secure, reliable code.

By contrast, an optimistic developer will be more likely to simply assume code will work, or that it is secure, or give a deadline for a project without considering all the potential pitfalls.

Likely to be heard saying: "And what happens when that goes wrong?"

Lazy

Laziness is not usually viewed as a desirable trait, and in this case I don't mean turns-up-late-and-pretends-to-work laziness or just-move-that-logic-to-the-view laziness - both entirely unwanted. I mean a desire to not do tasks that are repetitive, or to waste time doing things a machine can do for you, or even to avoid future work by writing better code now. A lazy developer is one that builds a reusable code library, or wants a fully automated build process rather than a manual copy-and-paste one, or wants comprehensive automated unit testing, or writes code to be scalable even though that wasn't a requirement (rather than revisit it later).

As a bonus, a lazy developer is also usually one who will try and keep a project focussed on its core goals, rather than try and cram more work into the same time, providing a buffer against feature creep.

For example, when writing a category structure, a lazy developer might be likely to assume a many-to-many relationship between parent and child categories, even though the project specification says it will be a one-to-many relationship. Why? Because it might be needed one day and it would be better to write it that way from the start than to revisit it later.

Likely to be heard saying: "We could automate that."

Curious

Good developers are often rather like Gregory House. They're very easily bored by repetitive work (see laziness) and spend most of their time ploughing through it looking for an interesting and challenging (and hopefully new) problem to solve. The less time they can spend on the repetitive, the higher the frequency of the challenges.

Curious developers will be constantly looking for new problems to solve, and better ways to solve previous problems. They'll be the ones encouraging new ways to work and constantly tweaking and trying to improve existing systems. They'll also be the ones most conscious of existing problems in the current working environment, and trying to correct those problems. Curious developers will usually have a wide breadth of knowledge, not just of their primary language(s), but of supportive, associated and alternative technologies.

Curious (or easily-bored) developers are often the least stuck in their ways - the most open to change. They may well need convincing of why a new way of working is better (and that's no bad thing) but as long as it's an improvement, and likely to release more time to spend on the interesting problems, they'll embrace it with a minimum of resistance.

Curiosity also breeds creativity, another highly desirable trait in any developer. A strong desire to work out what has caused a problem and how to solve it is highly likely to motivate someone to continue once obvious avenues are exhausted. It is that sort of mentality that fosters "outside the box" thinking and creative coding.

Possibly the most useful attribute of a curious developer is that desire to find and cure a problem rather than just paper over the crack.

Likely to be heard saying: "Maybe there's another way to do this."

Meticulous

Many great developers are sticklers for detail. They will demand consistency in their work and the work of their team (they're likely to care about common code standards and naming conventions, for example). They'll want unit testing and peer review of code. They'll want everyone in their team to comment on and document code. They are likely to be fussy about version control log messages.

They'll also be fussy about details in communication, and happy to ask what might seem like obvious questions, simply to be sure they have properly understood. This is especially true of things like bug reports. While they may not be terribly motivational communicators, they will usually be able to explain concepts clearly and effectively. That clarity is a tremendous advantage in any development environment, especially if teaching and learning are encouraged.

Likely to be heard saying: "I just have a couple of questions ..."

Translations



]]>
Thu, 17 Apr 2008 13:03:00 +0100 http://www.addedbytes.com/blog/what-makes-a-great-developer/ Dave Child ,,,,,,,,
PHP Querystring Functions http://www.addedbytes.com/blog/code/php-querystring-functions/ Adding and removing variables to and from URLs using PHP can be a relatively simple process admittedly, but I have a couple of functions I use often to make the process even less time-consuming.



Add Querystring Variable



A PHP function that will add the querystring variable $key with a value $value to $ur

Adding and removing variables to and from URLs using PHP can be a relatively simple process admittedly, but I have a couple of functions I use often to make the process even less time-consuming.

Add Querystring Variable

A PHP function that will add the querystring variable $key with a value $value to $url. If $key is already specified within $url, it will replace it.

function add_querystring_var($url, $key, $value) {
    $url = preg_replace('/(.*)(?|&)' . $key . '=[^&]+?(&)(.*)/i', '$1$2$4', $url . '&');
    $url = substr($url, 0, -1);
    if (strpos($url, '?') === false) {
        return ($url . '?' . $key . '=' . $value);
    } else {
        return ($url . '&' . $key . '=' . $value);
    }
}

Remove Querystring Variable

A PHP function that will remove the variable $key and its value from the given $url.

function remove_querystring_var($url, $key) {
    $url = preg_replace('/(.*)(?|&)' . $key . '=[^&]+?(&)(.*)/i', '$1$2$4', $url . '&');
    $url = substr($url, 0, -1);
    return ($url);
}


]]>
Tue, 05 Dec 2006 15:41:30 +0000 http://www.addedbytes.com/blog/code/php-querystring-functions/ Dave Child ,,,,,,,,,,
Writing Secure PHP, Part 3 http://www.addedbytes.com/articles/writing-secure-php/writing-secure-php-3/ In Writing Secure PHP and Writing Secure PHP, Part 2 I covered many of the basic mistakes PHP developers make, and how to avoid common security problems. It is time to get a little deeper into security though, and begin to tackle some more advanced issues.

[Writing Secure PHP is a series. Part 1, Part 2 and Part 4 are currently also available.]

Context

Before I start, it is worth mentioning at this point in this series that much of what is to come is highly dependant on context. If you are running a small personal site and are regularly backing it up, the chances are that there is no real benefit to you spending weeks on advanced security issues. If an attacker can gain nothing (and cause no harm) by compromising your site, and it would only take you ten minutes to restore it, should something go wrong, then it would be a waste to spend too long on security concerns. At the other end of the scale, if you are managing an ecommerce site that processes thousands of credit cards a day, then it is negligent not to spend a lot of time researching and improving your site's security.

Database Field Lengths

Database (we're going to talk about MySQL here, but this is applicable to any database) fields are always of a specific type, and every type has its limits. You can as well, in MySQL, limit field lengths further than they are already limited by their types.

However, to the inexperienced developer, this can present problems. If you are allowing users to post an article on your site, and adding that to a database field with type "blob", then the longest article you can store in the database is 65,535 characters. For most articles that will be fine, but what is going to happen when a user posts an article of 100,000 characters? At best, if you have set up your site so errors are not displayed, their article will simply vanish without being added to the site.

Remember that for an attacker to be able to compromise your system, they need information about it. They need to find weaknesses. Error messages are a very powerful part of that and if you are displaying errors, then an attacker can make use of this to find out information about your database.

To fix this, simply check the lengths of data input through forms and querystrings and ensure that before you launch a site you check forms will not cause errors to be displayed when too many characters are entered.

Weak Passwords

Dictionaries are a useful tool for an attacker. If you have a site with a login system and your database were compromised (and there is no harm in assuming that at some point it will be), an attacker can grab a list of hashed passwords. It is difficult (practically impossible) to directly translate a hash back into a password.

However, most attackers will have databases containing lists of words and their matching hashes in common formats (eg a database with all words in English and their MD5 hashes). It is fairly easy, should someone gain access to your database, for them to compare a hashed password to this list of pre-hashed passwords. If a match is found in the list, the attacker then knows what the un-hashed password is.

There are ways to avoid this problem, and the best of those is to ensure that only strong passwords are ever used. Some people find guaging the strength of passwords tricky, but the general rule of thumb is: a password like "password", "admin", "god", "sex", "qwerty", "123456" or similar (i.e. easily guessable) is extremely weak; a password made up only of a word in the dictionary is weak; a password made of letters, numbers and making use of upper and lower case is strong (there is a strong usability case to be made for not using case-sensitive passwords - if you wish to use case-insensitve ones, simply perform checks to ensure people do not pick passwords like "password12345").

Clients

Clients are a huge security risk, believe it or not. Some will hire a cheaper developer to make small changes six months after you're finished. Some will give out FTP details to anyone who phones and asks for them. [Out of curiosity, I decided to see how easy it is to get FTP details over the phone. I visited the site of a local company (who shall remain nameless) and found the name of their design company (who shall also remain nameless). I then phoned the local company and told them I was with the design company and needed them to send me the site's FTP details. They agreed without question or hesitation. Scary. (I told them what I was doing before they sent any sensitive data to me and they are now better educated and suitably paranoid about people asking for details over the phone).]

Some will ignore emails from people pointing out security problems (in the process of writing the previous article in this series, I found a large selection of sites with publically available database connection scripts. I emailed the owners explaining why they are at risk, and only one has replied and had the problem fixed at the time of writing). Admitedly, many of the emails and calls they receive will be misinformation or sales pitches, but it is still worth them having someone check this out - they do not know enough to distinguish a genuine problem from the rest.

Unfortunately, this is one security problem that cannot be solved with code. This one requires education. For this reason, I have created an unbranded copy of the sheet I give to my clients, with a selection of security tips on. When we launch the site, I sit down with them and tell them how they need to treat their site, and what to consider when making decisions regarding it.

Client Security Handout (PNG, 74KB)

Code Injection (a.k.a. "Cross-Site Scripting")

Unlike SQL Injection, which relies on the use of delimiters in user-input text to take control of database queries, code injection relies on mistakes in the treatment of text before it is output. Or, to put it in simpler terms, code injection is where a malicious user uses a text box to add HTML that they've written to your webpage.

Let's say you have a system that allows users to register as members to your site and that they are allowed to create their own username. They fill out a form, and you insert the data they enter, once you've made it safe to use in a SQL query, into a database. Your members listing page fetches all the usernames from the database and lists them, outputting exactly what is in the database to anyone that views that page.

Now, let's say you've not added a limit to username lengths. Someone could, if they wanted, create a user with the following username:

Username<script type="text/javascript" src="http://www.website.com/malicious.js"></script>

Anyone that then views a page with that username on it will see a normal username, but a JavaScript has been loaded from another site invisibly to the user.

There are plenty of uses for this. First and foremost, it allows attackers to add keyloggers, tracking scripts or porn banners on your site, or just stop your site working altogether. There are several ways to ensure this doesn't happen. First, you could encode HTML in usernames. If you wanted to allow people to use greater-then and less-than signs in their usernames, that is. If not, you can strip these characters out, or strip out HTML tags altogether.

Another, better way to approach this is to limit the character set that can be used in usernames. If you only allow letters and numbers, for example, you could simply use a regular expression in the signup process to validate the username and force the user to pick another if they have disallowed characters in their username. Obviously the problem is not just applicable to usernames - however, as with most other security concerns, being quite paranoid will ensure that you always check data coming from a user before outputting it, and sanitising it in an appropriate way.

Aftermath

Part of a good security strategy is the assumption that at some point everything (and I mean everything) will be deleted or destroyed. It is wise to assume that at some point any security measure you have in place will be compromised. All data may be taken (which is one reason why it is important to encrypt things like passwords and credit card numbers in databases), all files deleted and so on.

One part of PHP development, though perhaps not directly about PHP security, is ensuring that after a catastrophic failure a site can be brought back online quickly. While downtime of four hours maybe acceptable with a low-traffic point-of-presence site, any ecommerce retailer is going to erupt with fury at the thought of that much lost revenue.

Dealing with the client under these circumstances is the first step. Often, your first inkling of a problem with a site may actually come from the client. They may have phoned you and could be angry, worried, or a myriad of other emotions. At moments like this, you would be very glad to have a clear contigency plan in place. Many developers panic when the client phones saying their front page has been defaced. Stick to your action plan and to your client you will seem confident and unphased. That will relax them. The plan will also allow you to resolve the problem far faster.

First, find out what happened. Are you dealing with a security breach or has someone at the host company tripped over a power lead? Was the database compromised, or deleted, as a result of an attack or was your server simply unable to cope with too much traffic? You need to know what has happened in order to deal with it - a site going offline could be down to too many factors to just assume it is a security problem.

Assuming this is a security problem, the next step is to reassure the client. Let them know what has happened. If someone got into the database, no problem - all sensitive data is encrypted. If they've uploaded files to your server (quite possible), you'll have to delete all files and restore from a backup.

You've got to find out how the attacker broke into your system. Check log files, if you have access to them. Also, have a look at hacker and cracker web sites - many of them will list successful attacks against servers by various groups (these are often what are sometimes known as "script kiddies" - not hackers as such, but usually exploiting vulnerabilities found by others). You may well find your site listed and that listing will give you invaluable information. Look at other sites brought down by the same group at around the same time - you will often spot a theme (e.g. all sites that have been attacked were running the same version of IIS or Apache, were all running phpBB, or all are file repositories running on CFML).

If you are running any third party software on the site, check the distribution site and if necessary get in touch with them, especially if other sites running the same software appear to have been compromised.

It is very important that you fix any hole there may be before you restore the site. It would be wise to add a "We are currently undergoing essential maintenance" page, but do not fully restore the site before you have found out and fixed whatever the problem was - you'll be wasting your time.

Shared Hosting

Shared hosting is much cheaper than dedicated hosting, and is where several sites are all hosted on the same server. Most sites are hosted this way, and this brings with it its own set of security issues.

First and foremost, the security of your site is, in these circumstances, almost entirely out of your hands. It is dependant on the hosting company you are with. They may be excellent, or they may be crooks. Check reviews of a company before you select them, as they will have access to all the data you store with them. There is no harm in being automatically suspicious of your hosting company.

If they are completely above board (and most are), you are still not necessarily secure with shared hosting. The security measures they put in place are generally pretty simple. Shared hosting servers should always use PHP's safe mode (which disables many of the more advanced and dangerous features of PHP). That is what it is there for. However, many don't.

Vulnerabilities associated with shared hosting are, for the most part, out of your hands. A badly set up server will allow any site on that server to access files like /etc/passwd and httpd.conf, often giving them access to all other sites on the same server. It is possible to secure yourself to some degree against the effects of this vulnerability. Storing information in a database is recommended. Of course, if you then store your database login in a file, an attacked could access this information. In order to make this inaccessible to others on the same server, you could set database login information within the httpd.conf file, using environmental variables (you will need to ask your host company to add the lines to the httpd.conf file).

Better yet is to ensure that your host, if shared, uses safe mode. While this is still not 100% secure (nothing is), it does help make these attacks more difficult. A dedicated server is another, far better, option, but the expense may be prohibitive.

Ready for more? Try Writing Secure PHP, Part 4.



]]>
Wed, 27 Jul 2005 09:58:00 +0100 http://www.addedbytes.com/articles/writing-secure-php/writing-secure-php-3/ Dave Child ,,,,,,,
Writing Secure PHP, Part 2 http://www.addedbytes.com/articles/writing-secure-php/writing-secure-php-2/ In Writing Secure PHP, I covered a few of the most common security holes in websites. It's time to move on, though, to a few more advanced techniques for securing a website. As techniques for 'breaking into' a site or crashing a site become more advanced, so must the methods used to stop those attacks.

[Writing Secure PHP is a series. Part 1, Part 3 and Part 4 are currently also available.]

File Systems

Most hosting environments are very similar, and rather predictable. Many web developers are also very predictable. It doesn't take a genius to guess that a site's includes (and most dynamic sites use an includes directory for common files) is an www.website.com/includes/. If the site owner has allowed directory listing on the server, anyone can navigate to that folder and browse files.

Imagine for a second that you have a database connection script, and you want to connect to the database from every page on your site. You might well place that in your includes folder, and call it something like connect.inc. However, this is very predictable - many people do exactly this. Worst of all, a file with the extension ".inc" is usually rendered as text and output to the browser, rather than processed as a PHP script - meaning if someone were to visit that file in a browser, they'll be given your database login information.

Placing important files in predictable places with predictable names is a recipe for disaster. Placing them outside the web root can help to lessen the risk, but is not a foolproof solution. The best way to protect your important files from vulnerabilities is to place them outside the web root, in an unusually-named folder, and to make sure that error reporting is set to off (which should make life difficult for anyone hoping to find out where your important files are kept). You should also make sure directory listing is not allowed, and that all folders have a file named "index.html" in (at least), so that nobody can ever see the contents of a folder.

Never, ever, give a file the extension ".inc". If you must have ".inc" in the extension, use the extension ".inc.php", as that will ensure the file is processed by the PHP engine (meaning that anything like a username and password is not sent to the user). Always make sure your includes folder is outside your web root, and not named something obvious. Always make sure you add a blank file named "index.html" to all folders like include or image folders - even if you deny directory listing yourself, you may one day change hosts, or someone else may alter your server configuration - if directory listing is allowed, then your index.html file will make sure the user always receives a blank page rather than the directory listing. As well, always make sure directory listing is denied on your web server (easily done with .htaccess or httpd.conf).

------

Out of sheer curiosity, shortly after writing this section of this tutorial, I decided to see how many sites I could find in a few minutes vulnerable to this type of attack. Using Google and a few obvious search phrases, I found about 30 database connection scripts, complete with usernames and passwords. A little more hunting turned up plenty more open include directories, with plenty more database connections and even FTP details. All in, it took about ten minutes to find enough information to cause serious damage to around 50 sites, without even using these vulnerabilities to see if it were possible to cause problems for other sites sharing the same server.

-----

Login Systems

Most site owners now require an online administration area or CMS (content management system), so that they can make changes to their site without needing to know how to use an FTP client. Often, these are placed in predictable locations (as covered in the last article), however placing an administration area in a hard-to-find location isn't enough to protect it.

Most CMSes allow users to change their password to anything they choose. Many users will pick an easy-to-remember word, often the name of a loved one or something similar with special significance to them. Attackers will use something called a "dictionary attack" (or "brute force attack") to break this kind of protection. A dictionary attack involves entering each word from the dictionary in turn as the password until the correct one is found.

The best way to protect against this is threefold. First, you should add a turing test to a login page. Have a randomly generated series of letters and numbers on the page that the user must enter to login. Make sure this series changes each time the user tries to login, that it is an image (rather than simple text), and that it cannot be identified by an optical character recognition script.

Second, add in a simple counter. If you detect a certain number of failed logins in a row, disable logging in to the administration area until it is reactivated by someone responsible. If you only allow each potential attacker a small number of attempts to guess a password, they will have to be very lucky indeed to gain access to the protected area. This might be inconvenient for authentic users, however is usually a price worth paying.

Finally, make sure you track IP addresses of both those users who successfully login and those who don't. If you spot repeated attempts from a single IP address to access the site, you may consider blocking access from that IP address altogether.

Database Users

One excellent way to make sure that even if you have a problem with someone accessing your database who shouldn't be able to, you can limit the damage they can cause. Modern databases like MySQL and SQL Server allow you to control what a user can and cannot do. You can give users (or not) permission to create data, edit, delete, and more using these permissions. Usually, I try and ensure that I only allow users to add and edit data.

If a site requires an item be deleted, I will usually set the front end of the site to only appear to delete the item. For example, you could have a numeric field called "item_deleted", and set it to 1 when an item is deleted. You can then use that to prevent users seeing these items. You can then purge these later if required, yourself, while not giving your users "delete" permissions for the database. If a user cannot delete or drop tables, neither can someone who finds out the user login to the database (though obviously they can still do damage).

Powerful Commands

PHP contains a variety of commands with access to the operating system of the server, and that can interact with other programs. Unless you need access to these specific commands, it is highly recommended that you disable them entirely.

For example, the eval() function allows you to treat a string as PHP code and execute it. This can be a useful tool on occasion. However, if using the eval() function on any input from the user, the user could cause all sorts of problems. You could be, without careful input validation, giving the user free reign to execute whatever commands he or she wants.

There are ways to get around this. Not using eval() is a good start. However, the php.ini file gives you a way to completely disable certain functions in PHP - "disable_functions". This directive of the php.ini file takes a comma-separated list of function names, and will completely disable these in PHP. Commonly disabled functions include ini_set(), exec(), fopen(), popen(), passthru(), readfile(), file(), shell_exec() and system().

It may be (it usually is) worth enabling safe_mode on your server. This instructs PHP to limit the use of functions and operators that can be used to cause problems. If it is possible to enable safe_mode and still have your scripts function, it is usually best to do so.

Finally, Be Completely and Utterly Paranoid

Much as I hate to bring this point up again, it still holds true (and always will). Most of the above problems can be avoided through careful input validation. Some become obvious points to address when you assume everyone is out to destroy your site. If you are prepared for the worst, you should be able to deal with anything.

Ready for more? Try Writing Secure PHP, Part 3.



]]>
Tue, 22 Mar 2005 16:53:00 +0000 http://www.addedbytes.com/articles/writing-secure-php/writing-secure-php-2/ Dave Child ,,,,,,,,
Password Protect a Directory with .htaccess http://www.addedbytes.com/blog/code/password-protect-a-directory-with-htaccess/ Password protecting a directory can be done several ways. Many people use PHP or ASP to verify users, but if you want to protect a directory of files or images (for example), that often isn't practical. Fortunately, Apache has a built-in method for protecting directories from prying eyes, using the .htaccess file.

In order to protect your chosen directory, you will first need to create an .htaccess file. This is the file that the server will check before allowing access to anything in the same directory. That's right, the .htaccess file belongs in the directory you are protecting, and you can have one in each of as many directories as you like.

You'll need first to define a few parameters for the .htaccess file. It needs to know where to find certain information, for example a list of valid usernames and passwords. This is a sample of the few lines required in an .htaccess file to begin with, telling it where the usernames and passwords can be found, amongst other things.

AuthUserFile /full/path/to/.htpasswd AuthName "Please Log In" AuthType Basic

You've now defined a few basic parameters for Apache to manage the authorisation process. First, you've defined the location of the .htpasswd file. This is the file that contains all the usernams and encrypted passwords for your site. We'll cover adding information to this file shortly. It's extremely important that you place this file outside of the web root. You should only be able to access it by FTP, not over the web.

The AuthName parameter basically just defines the title of the password entry box when the user logs in. It's not exactly the most important part of the file, but should be defined. The AuthType tells the server what sort of processing is in use, and "Basic" is the most common and perfectly adequate for almost any purpose.

We've told apache where to find files, but we've not told it who, of those people defined in the .htpasswd file, can access the directory. For that reason, we still have another line to define.

If we want to grant access to everyone in the .htpasswd file, we can add this line ("valid-user" is like a keyword, telling apache any user will do):

require valid-user

If we want to just grant access to a single user, we can use "user" and their username instead of "valid-user":

require user dave

A normal and complete .htaccess file might look like this:

AuthUserFile /home/dave/.htpasswd AuthName "Dave's Login Area" AuthType Basic require user dave

Now we have almost everything defined, but we are still missing an .htpasswd file. Without that, the server won't know what usernames and passwords are ok.

An .htpasswd file is made up of a series of lines, one for each valid user. Each line looks like this, with a username, then colon, then encrypted password:

username:encryptedpassword

The password encryption is the same as you'll find in PHP's crypt() function. It is not reversible, so you can't find out a password from the encrypted version. (Please note that on page 2 of this article is a tool to help you generate an .htpasswd file, that will help you encrypt passwords).

A user of "dave" and password of "dave" might be added with the following line:

dave:XO5UAT7ceqPvc

Each time you run an encryption function like "crypt", you will almost certainly get a different result. This is down to something called "salt", which in the above case was "XO" (first two letters of encrypted password). Different salt will give different encrypted values, and if not explicitly specified will be randomly generated. Don't worry though, the server is quite capable of understanding all this - if you come up with a different value for the encrypted password and replace it, everything would still work fine, as long as the password was the same.

Once you've created your .htpasswd file, you need to upload it to a safe location on your server, and check you've set the .htaccess file to point to it correctly. Then, upload the .htaccess file to the directory you want to protect and you'll be all set. Simply visit the directory to check it is all working.

.htpasswd Generator

The .htpasswd file needs encrypted passwords, which can be a problem for anyone without experience with a programming language. For that reason, I've created this simple tool, which, if you enter the username and password you wish to use, will generate the appropriate line to add to your .htpasswd file.


[!htpasswd!]

]]>
Tue, 15 Mar 2005 09:58:46 +0000 http://www.addedbytes.com/blog/code/password-protect-a-directory-with-htaccess/ Dave Child ,,,,,,,,,,,,
Writing Secure PHP, Part 1 http://www.addedbytes.com/articles/writing-secure-php/writing-secure-php-1/ PHP is a very easy language to learn, and many people without any sort of background in programming learn it as a way to add interactivity to their web sites. Unfortunately, that often means PHP programmers, especially those newer to web development, are unaware of the potential security risks their web applications can contain. Here are a few of the more common security problems and how to avoid them.

[Writing Secure PHP is a series. Part 2, Part 3 and Part 4 are currently also available.]

Rule Number One: Never, Ever, Trust Your Users

It can never be said enough times, you should never, ever, ever trust your users to send you the data you expect. I have heard many people respond to that with something like "Oh, nobody malicious would be interested in my site". Leaving aside that that could not be more wrong, it is not always a malicious user who can exploit a security hole - problems can just as easily arise because of a user unintentionally doing something wrong.

So the cardinal rule of all web development, and I can't stress it enough, is: Never, Ever, Trust Your Users. Assume every single piece of data your site collects from a user contains malicious code. Always. That includes data you think you have checked with client-side validation, for example using JavaScript. If you can manage that, you'll be off to a good start. If PHP security is important to you, this single point is the most important to learn. Personally, I have a "PHP Security" sheet next to my desk with major points on, and this is in large bold text, right at the top.

Global Variables

In many languages you must explicitly create a variable in order to use it. In PHP, there is an option, "register_globals", that you can set in php.ini that allows you to use global variables, ones you do not need to explicitly create.

Consider the following code:

if ($password == "my_password") {
    $authorized = 1;
}

if ($authorized == 1) {
    echo "Lots of important stuff.";
}

To many that may look fine, and in fact this exact type of code is in use all over the web. However, if a server has "register_globals" set to on, then simply adding "?authorized=1" to the URL will give anyone free access to exactly what you do not want everyone to see. This is one of the most common PHP security problems.

Fortunately, this has a couple of possible simple solutions. The first, and perhaps the best, is to set "register_globals" to off. The second is to ensure that you only use variables that you have explicitly set yourself. In the above example, that would mean adding "$authorized = 0;" at the beginning of the script:

$authorized = 0;
if ($password == "my_password") {
    $authorized = 1;
}

if ($authorized == 1) {
    echo "Lots of important stuff.";
}

Error Messages

Errors are a very useful tool for both programmer and hacker. A developer needs them in order to fix bugs. A hacker can use them to find out all sorts of information about a site, from the directory structure of the server to database login information. If possible, it is best to turn off all error reporting in a live application. PHP can be told to do this through .htaccess or php.ini, by setting "error_reporting" to "0". If you have a development environment, you can set a different error reporting level for that.

SQL Injection

One of PHP's greatest strengths is the ease with which it can communicate with databases, most notably MySQL. Many people make extensive use of this, and a great many sites, including this one, rely on databases to function.

However, as you would expect, with that much power there are potentially huge security problems you can face. Fortunately, there are plenty of solutions. The most common security hazard faced when interacting with a database is that of SQL Injection - when a user uses a security glitch to run SQL queries on your database.

Let's use a common example. Many login systems feature a line that looks a lot like this when checking the username and password entered into a form by a user against a database of valid username and password combinations, for example to control access to an administration area:

$check = mysql_query("SELECT Username, Password, UserLevel FROM Users WHERE Username = '".$_POST['username']."' and Password = '".$_POST['password']."'");

Look familiar? It may well do. And on the face of it, the above does not look like it could do much damage. But let's say for a moment that I enter the following into the "username" input box in the form and submit it:

' OR 1=1 #

The query that is going to be executed will now look like this:

SELECT Username, Password FROM Users WHERE Username = '' OR 1=1 #' and Password = ''

The hash symbol (#) tells MySQL that everything following it is a comment and to ignore it. So it will actually only execute the SQL up to that point. As 1 always equals 1, the SQL will return all of the usernames and passwords from the database. And as the first username and password combination in most user login databases is the admin user, the person who simply entered a few symbols in a username box is now logged in as your website administrator, with the same powers they would have if they actually knew the username and password.

With a little creativity, the above can be exploited further, allowing a user to create their own login account, read credit card numbers or even wipe a database clean.

Fortunately, this type of vulnerability is easy enough to work around. By checking for apostrophes in the items we enter into the database, and removing or neutralising them, we can prevent anyone from running their own SQL code on our database. The function below would do the trick:

function make_safe($variable) {
    $variable = mysql_real_escape_string(trim($variable));
    return $variable;
}

Now, to modify our query. Instead of using _POST variables as in the query above, we now run all user data through the make_safe function, resulting in the following code:

$username = make_safe($_POST['username']);
$password = make_safe($_POST['password']);
$check = mysql_query("SELECT Username, Password, UserLevel FROM Users WHERE Username = '".$username."' and Password = '".$password."'");

Now, if a user entered the malicious data above, the query will look like the following, which is perfectly harmless. The following query will select from a database where the username is equal to "\' OR 1=1 #".

SELECT Username, Password, UserLevel FROM Users WHERE Username = '\' OR 1=1 #' and Password = ''

Now, unless you happen to have a user with a very unusual username and a blank password, your malicious attacker will not be able to do any damage at all. It is important to check all data passed to your database like this, however secure you think it is. HTTP Headers sent from the user can be faked. Their referral address can be faked. Their browsers User Agent string can be faked. Do not trust a single piece of data sent by the user, though, and you will be fine.

File Manipulation

Some sites currently running on the web today have URLs that look like this:

index.php?page=contactus.html

The "index.php" file then simply includes the "contactus.html" file, and the site appears to work. However, the user can very easily change the "contactus.html" bit to anything they like. For example, if you are using Apache's mod_auth to protect files and have saved your password in a file named ".htpasswd" (the conventional name), then if a user were to visit the following address, the script would output your username and password:

index.php?page=.htpasswd

By changing the URL, on some systems, to reference a file on another server, they could even run PHP that they have written on your site. Scared? You should be. Fortunately, again, this is reasonably easy to protect against. First, make sure you have correctly set "open_basedir" in your php.ini file, and have set "allow_url_fopen" to "off". That will prevent most of these kinds of attacks by preventing the inclusion of remote files and system files. Next, if you can, check the file requested against a list of valid files. If you limit the files that can be accessed using this script, you will save yourself a lot of aggravation later.

Using Defaults

When MySQL is installed, it uses a default username of "root" and blank password. SQL Server uses "sa" as the default user with a blank password. If someone finds the address of your database server and wants to try to log in, these are the first combinations they will try. If you have not set a different password (and ideally username as well) than the default, then you may well wake up one morning to find your database has been wiped and all your customers' credit card numbers stolen. The same applies to all software you use - if software comes with default username or password, change them.

Leaving Installation Files Online

Many PHP programs come with installation files. Many of these are self-deleting once run, and many applications will refuse to run until you delete the installation files. Many however, will not pay the blindest bit of attention if the install files are still online. If they are still online, they may still be usable, and someone may be able to use them to overwrite your entire site.

Predictability

Let us imagine for a second that your site has attracted the attention of a Bad Person. This Bad Person wants to break in to your administration area, and change all of your product descriptions to "This Product Sucks". I would hazard a guess that their first step will be to go to http://www.yoursite.com/admin/ - just in case it exists. Placing your sensitive files and folders somewhere predictable like that makes life for potential hackers that little bit easier.

With this in mind, make sure you name your sensitive files and folders so that they are tough to guess. Placing your admin area at http://www.yoursite.com/jsfh8sfsifuhsi8392/ might make it harder to just type in quickly, but it adds an extra layer of security to your site. Pick something memorable by all means if you need an address you can remember quickly, but don't pick "admin" or "administration" (or your username or password). Pick something unusual.

The same applies to usernames and passwords. If you have an admin area, do not use "admin" as the username and "password" as the password. Pick something unusual, ideally with both letters and numbers (some hackers use something called a "dictionary attack", trying every word in a dictionary as a password until they find a word that works - adding a couple of digits to the end of a password renders this type of attack useless). It is also wise to change your password fairly regularly (every month or two).

Finally, make sure that your error messages give nothing away. If your admin area gives an error message saying "Unknown Username" when a bad username is entered and "Wrong Password" when the wrong password is entered, a malicious user will know when they've managed to guess a valid username. Using a generic "Login Error" error message for both of the above means that a malicious user will have no idea if it is the username or password he has entered that is wrong.

Finally, Be Completely and Utterly Paranoid

If you assume your site will never come under attack, or face any problems of any sort, then when something eventually does go wrong, you will be in massive amounts of trouble. If, on the other hand, you assume every single visitor to your site is out to get you and you are permanently at war, you will help yourself to keep your site secure, and be prepared in case things should go wrong.

Ready for more? Try Writing Secure PHP, Part 2.



]]>
Fri, 16 Jul 2004 10:07:15 +0100 http://www.addedbytes.com/articles/writing-secure-php/writing-secure-php-1/ Dave Child ,,,,,,,,,,,
Flesch-Kincaid Reading Level http://www.addedbytes.com/blog/code/flesch-kincaid-function/ PLEASE NOTE: This code is now considered out of date. An updated version has been released under an open source license as a Google Code project: php-text-statistics. There is more about this change in the post Readability Code Open Sourced.

A tool for checking the readability scores of text is available - this article covers the functions behind that tool.

Calculations based upon word structure can tell you a fair bit about the text on your site, most notably the readability of your copy. A lot of sites have text on them that is simply too advanced for their users, which is as useful as having no text at all.

It is therefore usually a good idea to check the copy on your website as thoroughly as possible. Spelling and grammar should be checked as a matter of course. You should also check how difficult your text is to read. If a user cannot easily understand what they are reading, they will leave the site and find one they can comprehend.

The following are two calculations that can give you an indicator of how easy your text is to read.

Flesch-Kincaid Reading Ease

The Flesch-Kincaid reading ease score is worked out using the following calculation, which gives a number. The higher that number is, the easier the text is to read.

206.835 - (1.015 * average_words_sentence) - (84.6 * average_syllables_word)

The function you will need to use to work this score out (in addition to the three at the bottom of this page) is:

function calculate_flesch($text) { return (206.835 - (1.015 * average_words_sentence($text)) - (84.6 * average_syllables_word($text))); }

And you can call the function like so:

$flesh_score = calculate_flesch($text);

Flesch-Kincaid Grade level

The Flesch-Kincaid grade level is a similar calculation, however gives a number that corresponds to the grade a person will need to have reached to understand it. For example, a Grade level score of 8 means that an eighth grader will understand the text.

(.39 * average_words_sentence) + (11.8 * average_syllables_word) - 15.59

The function you will need to use to work this score out (in addition to the three at the bottom of this page) is:

function calculate_flesch_grade($text) { return ((.39 * average_words_sentence($text)) + (11.8 * average_syllables_word($text)) - 15.59); }

And you can call the function like so:

$flesh_score = calculate_flesch_grade($text);

Both of the functions above make use of the functions below, so these will need to be included in your scripts in order for either function to be used.

Each score returned is not perfectly accurate. Unfortunately, it is not always possible to work out the number of syllables in a word programatically, and not always possible to correctly calculate the number of words per sentence, or indeed number of sentences, in text. However, the function will return a close approximation of the value - certainly good enough for our purposes.

Ideally, you should aim for a reading ease of around 60 to 70 (equivalent to a Grade level of around 6 to 8). The nearer 100 your text scores, the easier it is to read (and conversely, the lower the grade score, the easier the text is to read). Comics, for example, are usually in the 90s. The Harvard Law Review scores in the low 30s. Legal documents are usually lucky to make it into double figures.

The functions you will need in order to calculate the Flesch-Kincaid reading ease or Grade level of text are:

function average_words_sentence($text) { $sentences = strlen(preg_replace('/[^\.!?]/', '', $text)); $words = strlen(preg_replace('/[^ ]/', '', $text)); return ($words/$sentences); } function average_syllables_word($text) { $words = explode(' ', $text); for ($i = 0; $i < count($words); $i++) { $syllables = $syllables + count_syllables($words[$i]); } return ($syllables/count($words)); } function count_syllables($word) { $subsyl = Array( 'cial' ,'tia' ,'cius' ,'cious' ,'giu' ,'ion' ,'iou' ,'sia$' ,'.ely$' ); $addsyl = Array( 'ia' ,'riet' ,'dien' ,'iu' ,'io' ,'ii' ,'[aeiouym]bl$' ,'[aeiou]{3}' ,'^mc' ,'ism$' ,'([^aeiouy])\1l$' ,'[^l]lien' ,'^coa[dglx].' ,'[^gq]ua[^auieo]' ,'dnt$' ); // Based on Greg Fast's Perl module Lingua::EN::Syllables $word = preg_replace('/[^a-z]/is', '', strtolower($word)); $word_parts = preg_split('/[^aeiouy]+/', $word); foreach ($word_parts as $key => $value) { if ($value <> '') { $valid_word_parts[] = $value; } } $syllables = 0; // Thanks to Joe Kovar for correcting a bug in the following lines foreach ($subsyl as $syl) { $syllables -= preg_match('~'.$syl.'~', $word); } foreach ($addsyl as $syl) { $syllables += preg_match('~'.$syl.'~', $word); } if (strlen($word) == 1) { $syllables++; } $syllables += count($valid_word_parts); $syllables = ($syllables == 0) ? 1 : $syllables; return $syllables; }

Examples

The following are two examples of text and the readability of that text.

The first is an excerpt from [url=http://www.online-literature.com/grahame/windwillows/]The Wind in the Willows[/url]. It is what most people would call easy to read:

"There's Toad Hall," said the Rat; "and that creek on the left, where the notice-board says, 'Private. No landing allowed,' leads to his boat-house, where we'll leave the boat. The stables are over there to the right. That's the banqueting-hall you're looking at now - very old, that is. Toad is rather rich, you know, and this is really one of the nicest houses in these parts, though we never admit as much to Toad."

For reading ease, this scored 69. It also had a Grade Level of 7. This particular passage of Wind in the Willows scores at almost exactly the same level web page text should ideally score.

On the other hand, the following (both this text and the above were generously provided by [url=http://members.dca.net/slawski]Bill Slawski[/url], by the way) is an excerpt from a legal document, and would give many a headache:

The foregoing warranties by each party are in lieu of all other warranties, express or implied, with respect to this agreement, including but not limited to implied warranties of merchantability and fitness for a particular purpose. Neither party shall have any liability whatsoever for any cover or setoff nor for any indirect, consequential, exemplary, incidental or punitive damages, including lost profits, even if such party has been advised of the possibility of such damages.

This scores an incredible -1 on the reading ease scale. The Grade Level required to read it? 22. This is what you could widely consider the most unreadable text you could add to a web page.

These are, perhaps, extreme examples, but they should give an idea of the differences between good and bad text on a web page.



]]>
Wed, 07 Jul 2004 14:17:00 +0100 http://www.addedbytes.com/blog/code/flesch-kincaid-function/ Dave Child ,,,,,,,,,,
Gunning-Fog Index http://www.addedbytes.com/blog/code/gunning-fog-function/ PLEASE NOTE: This code is now considered out of date. An updated version has been released under an open source license as a Google Code project: php-text-statistics. There is more about this change in the post Readability Code Open Sourced.

A tool for checking the readability scores of text is available - this article covers the functions behind that tool.

The Gunning-Fog index is a measure of text readability. It represents the approximate reading age of the text - the age someone will need to be to understand what they are reading.

The following is the algorithm to determine the Gunning-Fog index:

(average_words_sentence + percentage_of_words_with_more_than_three_syllables) * 0.4

The above produces a number, which is a rough measure of the age someone must be to understand the content. The lower the number, the more understandable the content will be to your visitors. Web sites should aim to have content that falls roughly in the 11-15 range for this test.

Any number returned over the value of 22 can be taken to be just 22, and is roughly equivalent to post-graduate level.

Below are a selection of function you can use to determine the Gunning-Fog index of text. To calculate this, all you need to is call the function as follows, where $text is the text you wish to measure the readability of.

$gunning_fog_score = gunning_fog_score($text); function gunning_fog_score($text) { return ((average_words_sentence($text) + percentage_number_words_three_syllables($text)) * 0.4); } function average_words_sentence($text) { $sentences = strlen(preg_replace('/[^\.!?]/', '', $text)); $words = strlen(preg_replace('/[^ ]/', '', $text)); return ($words/$sentences); } function percentage_number_words_three_syllables($text) { $syllables = 0; $words = explode(' ', $text); for ($i = 0; $i < count($words); $i++) { if (count_syllables($words[$i]) > 2) { $syllables ++; } } $score = number_format((($syllables / count($words)) * 100)); return ($score); } function count_syllables($word) { $subsyl = Array( 'cial' ,'tia' ,'cius' ,'cious' ,'giu' ,'ion' ,'iou' ,'sia$' ,'.ely$' ); $addsyl = Array( 'ia' ,'riet' ,'dien' ,'iu' ,'io' ,'ii' ,'[aeiouym]bl$' ,'[aeiou]{3}' ,'^mc' ,'ism$' ,'([^aeiouy])\1l$' ,'[^l]lien' ,'^coa[dglx].' ,'[^gq]ua[^auieo]' ,'dnt$' ); // Based on Greg Fast's Perl module Lingua::EN::Syllables $word = preg_replace('/[^a-z]/is', '', strtolower($word)); $word_parts = preg_split('/[^aeiouy]+/', $word); foreach ($word_parts as $key => $value) { if ($value <> '') { $valid_word_parts[] = $value; } } $syllables = 0; // Thanks to Joe Kovar for correcting a bug in the following lines foreach ($subsyl as $syl) { $syllables -= preg_match('~'.$syl.'~', $word); } foreach ($addsyl as $syl) { $syllables += preg_match('~'.$syl.'~', $word); } if (strlen($word) == 1) { $syllables++; } $syllables += count($valid_word_parts); $syllables = ($syllables == 0) ? 1 : $syllables; return $syllables; }

]]>
Tue, 06 Jul 2004 11:41:35 +0100 http://www.addedbytes.com/blog/code/gunning-fog-function/ Dave Child ,,,,,