Tagged with "validation" http://www.addedbytes.com/feeds/tag-feed/ en Web Development in Brighton - Added Bytes 2006 120 On-The-Fly Validation http://www.addedbytes.com/blog/code/on-the-fly-validation/ Of all coding errors in websites (usually highlighted by code validators), there are a few that crop up time and time again. These common coding bugs account for more 90% of the mistakes in web sites. Despite being so prevalent, most designers still allow these basic errors to creep into their code.

Note: Many people may consider "validation errors" unimportant. However, if you are going to write a web page in a specific language it makes sense to actually use that language properly, rather than making up your own random dialect. After all, can you be sure that that dialect will be interpreted the same way every single time? Any while many people find errors like this easy to ignore, they should remember that while they might not stop a page being usable, what validators bring up are "coding errors" - mistakes in the markup of the page.

These scripts are intended to make life a touch easier for busy developers. Of course, these scripts will slow down your site, and are no substitute for actually writing valid code in the first place. They are intended to catch the occasional bug that you may have missed, or that may be introduced through a comments system, for example.

In order to make use of the following code, you will need to be using PHP 4 or higher on Apache. The following scripts make use of output buffering and work best with a caching system in place as well.

This script will not remove all of your validation errors. It cannot remove them all without running very slowly - there is a lot to check in each document. However, it can catch a few of the more common bugs that most designers miss at least once in a website.

To begin, the scripts we use start output buffering. This means that rather than send the page to the user as it is created, the page is saved on the server until the server is told to output the page (or the script ends). This will allow us to modify page output without needing to worry about editing the PHP behind it. To start output buffering, you need to include the following code at the top of each page. You can include it using the "include()" or "require()" functions, or using htaccess's superb auto_prepend_file function (which you can see in use in this caching tutorial).

ob_start();

After the script has run, we need to include another script at the end to process the page and output it to the user. You can, again, use "include()" or "require()", or auto_append_file in htaccess to include this script.

The script itself runs in three steps. The first step, below, grabs the contents of the output buffer. This will create a variable called "$output" that contains the page we were about to send to the user. $output contains the HTML after all PHP has run as normal, so the variable literally only contains what the user would normally see. The second line empties the output buffer (but does not send its contents to the user).

$output = ob_get_contents();
$output = trim($output);
ob_end_clean();

Now $output contains the page we are about to send the user, it is time to run the various checks we want, to make sure there are no validation errors in place.

if ((strpos($output, "<!DOCTYPE") > strpos($output, "<html")) or (strpos($output, "<!DOCTYPE") === false)) {
    $output = str_replace('$lt;html', "$lt;!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n$lt;html", $output);
}

First, we check for a DTD. These are important, as they tell the user agent (e.g. the browser) what language a page is written in. The above checks for the presence of a DTD before the <html> tag, and if it is missing it adds in the DTD for HTML 4.01 Transitional - probably the most common one in use today.

function encode_chars($text) {
    $text = str_replace("<", "&lt;", $text);

    $tag_list = '((\/?)(!DOCTYPE|!--|a(bbr|cronym|ddress|pplet|rea)?|b(ase(font)?|do|ig|lockquote|ody|r|utton)?|c(aption|enter|ite|(o(de|l(group)?)))|d(d|el|fn|i(r|v)|l|t)|em|f(ieldset|o(nt|rm)|rame(set)?)|h(1|2|3|4|5|6|ead|r|tml)|i(frame|mg|n(put|s)|sindex)?|kbd|l(abel|egend|i(nk)?)|m(ap|e(nu|ta))|no(frames|script)|o(bject|l|pt(group|ion))|p(aram|re)?|q|s(amp|cript|elect|mall|pan|t(r(ike|ong)|yle)|u(b|p))|t(able|body|d|extarea|foot|h|itle|r|t)|u(l)?|var)([^>]*))';

    $text = preg_replace("/(&lt;)" . $tag_list . "(>)/mi", "<$2>", $text);
    $text = preg_replace("/(>[^<]*)>/mi", "$1&gt;", $text);
    $text = str_replace("/>", ">", $text);

    return $text;
}

$output = encode_chars($output);

Next, we run a function on the script to check the tags on the page. Any tags that don't belong there are encoded so they are displayed rather than processed. We use the HTML 4.01 tag list, which means we will catch the worst of the invalid tags.

$output = preg_replace("/<img([^>]*)alt=([^>]*)>/im", "<img$1`alt=$2>", $output);
$output = preg_replace("/<img([^`|>]*)>/im", "<img alt=\" \"$1>", $output);
$output = preg_replace("/<img([^>]*)`alt=([^>]*)>/im", "<img$1alt=$2>", $output);

This small snippet of code checks for alt attributes on images. If they are missing, it adds a single space as an alt attribute. This is by no means optimal (and the regex and technique is ugly - if anyone can improve on this, please give me a shout!), however does mean that if an alt attribute is missed, screen readers will not simply give the name of the image file. You should always take the greatest care to ensure that all images have appropriate alt attributes.

Next, we do a little language-specific work. In the above code, we removed all closing slashes (e.g. in a <br /> tag). Now, if we are using XHTML, we add them back in for the appropriate elements. We also check the case of elements if using XHTML, as tags and attributes must be lower case in XHTML. This will only affect attributes whose values are quoted.

function process_attributes($text) {
    return preg_replace("/ ([a-z]+)=\"([^( |\")]*)\"/mie", "' ' . strtolower('$1') . '=\"' . stripslashes('$2') . '\"'", $text);
}

if (strpos($output, "//W3C//DTD XHTML") !== false) {
    $output = encode_chars($output, "XHTML");
    $output = preg_replace("/<(img|hr|meta|link|br|base|frame|input)([^>]*)>/mi", "<$1$2 />", $output);
    $output = preg_replace("/<(\/?)([a-z]+)( |>)/mie", "'<$1' . strtolower('$2') . '$3'", $output);
    $output = preg_replace("/<([^>]+)>/mie", "'<'.process_attributes(stripslashes('$1')).'>'", $output);
    $output = preg_replace("/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/i", "&", $output);
}

We also, at the end, encode any ampersands that should be encoded. Many thanks to [url=http://www.shauninman.com]Shaun Inman[/url] for the last line.

Finally, we need to send the processed output to the user.

$output = str_replace("<b>", "<strong>", $output);
$output = str_replace("<i>", "<em>", $output);
$output = str_replace("</b>", "</strong>", $output);
$output = str_replace("</i>", "</em>", $output);
echo $output;

At this stage, the code sent to the user will have a valid Document Type Definition. All tags will be correctly closed whether using HTML or XHTML. All images will have alt attributes. If we're using XHTML, all tags and attributes will be lower case (as long as the attributes are quoted). All invalid opening and closing tags will have been encoded. All ampersands should be properly encoded. And for good measure, we've replaced all bold (<b>) and italic (<i>) tags with the proper <strong> and <em> tags.

If you put it all together, you get the following code to be included at the end of each script:

<?php

$output = ob_get_contents();
$output = trim($output);
ob_end_clean();
    
function process_attributes($text) {
    return preg_replace("/ ([a-z]+)=\"([^( |\")]*)\"/mie", "' ' . strtolower('$1') . '=\"' . stripslashes('$2') . '\"'", $text);
}

function encode_chars($text) {
    $text = str_replace("<", "&lt;", $text);

    $tag_list = '((\/?)(!DOCTYPE|!--|a(bbr|cronym|ddress|pplet|rea)?|b(ase(font)?|do|ig|lockquote|ody|r|utton)?|c(aption|enter|ite|(o(de|l(group)?)))|d(d|el|fn|i(r|v)|l|t)|em|f(ieldset|o(nt|rm)|rame(set)?)|h(1|2|3|4|5|6|ead|r|tml)|i(frame|mg|n(put|s)|sindex)?|kbd|l(abel|egend|i(nk)?)|m(ap|e(nu|ta))|no(frames|script)|o(bject|l|pt(group|ion))|p(aram|re)?|q|s(amp|cript|elect|mall|pan|t(r(ike|ong)|yle)|u(b|p))|t(able|body|d|extarea|foot|h|itle|r|t)|u(l)?|var)([^>]*))';

    $text = preg_replace("/(&lt;)" . $tag_list . "(>)/mi", "<$2>", $text);
    $text = preg_replace("/(>[^<]*)>/mi", "$1&gt;", $text);
    $text = str_replace("/>", ">", $text);

    return $text;
}

if ((strpos($output, "<!DOCTYPE") > strpos($output, "<html")) or (strpos($output, "<!DOCTYPE") === false)) {
    $output = str_replace('<html', "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n<html", $output);
}

$output = encode_chars($output);
$output = preg_replace("/<img([^>]*)alt=([^>]*)>/im", "<img$1`alt=$2>", $output);
$output = preg_replace("/<img([^`|>]*)>/im", "<img alt=\" \"$1>", $output);
$output = preg_replace("/<img([^>]*)`alt=([^>]*)>/im", "<img$1alt=$2>", $output);

if (strpos($output, "//W3C//DTD XHTML") !== false) {
    $output = preg_replace("/<(img|hr|meta|link|br|base|frame|input)([^>]*)>/mi", "<$1$2 />", $output);
    $output = preg_replace("/<(\/?)([a-z]+)( |>)/mie", "'<$1' . strtolower('$2') . '$3'", $output);
    $output = preg_replace("/<([^>]+)>/mie", "'<'.process_attributes(stripslashes('$1')).'>'", $output);
    $output = preg_replace("/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/i", "&", $output);
}

$output = str_replace("<b>", "<strong>", $output);
$output = str_replace("<i>", "<em>", $output);
$output = str_replace("</b>", "</strong>", $output);
$output = str_replace("</i>", "</em>", $output);
echo $output;

?>


]]>
Thu, 01 Jul 2004 13:40:00 +0100 http://www.addedbytes.com/blog/code/on-the-fly-validation/ Dave Child ,,,,
Email Address Validation http://www.addedbytes.com/blog/code/email-address-validation/ PLEASE NOTE: This function is now considered out of date. An updated version incorporating many of the comments below has been released under an open source license as a Google Code project: php-email-address-validation. There is more about this change in the post Email Address Validation Updated.

Many email address validators will actually throw up errors when faced with a valid, but unusual, email address. Many, for example, assume that an email address with a domain name extension of more than three letters is invalid. However, new TLDs such as ".info", ".name" and ".aero" are perfectly valid but longer than three characters. Many email address validators fail to take into account that you do not necessarily need a domain name in an email address - an IP address is fine.

The first step to creating a PHP script for validating email addresses is to work out exactly what is and is not valid. RFC 2822, that specifies what is and is not allowed in an email address, states that the form of an email address must be of the form "local-part @ domain".

The "local-part" of an email address must be between 1 and 64 characters in length and may be made up in any one of three ways. It can be made up of a selection of characters (and only these characters) from the following selection (though the period can not be the first of these):

  • A to Z
  • 0 to 9
  • !
  • #
  • $
  • %
  • &
  • '
  • *
  • +
  • -
  • /
  • =
  • ?
  • ^
  • _
  • `
  • {
  • |
  • }
  • ~
  • .

Or, it can be made up of a quoted string containing any characters except "\". Older email addresses may be made up differently, and may contain a combination of the above. The following are all valid as the first part of an email address:

  • dave
  • +1~1+
  • {_dave_}
  • ""
  • dave."dave" (Note that this is considered an obsolete form of address - new addresses created should not be of this form, but it is still considered valid.)

The following, though similar, are all invalid:

  • -- dave -- (spaces are invalid unless enclosed in quotation marks)
  • [dave] (square brackets are invalid, unless contained within quotation marks)
  • .dave (the local part of a domain name cannot start with a period)

The "domain" portion of the email address can also be made up in different ways. The most common form is a domain name, which is made up of a number of "labels", each separated by a period and between 1 and 63 characters in length. Labels may contain letters, digits and hyphens, however must not begin or end with a hyphen (officially, a label must begin with a letter, not a digit, however many domain names have been registered beginning with digits so for the purposes of validation we will assume that digits are allowed at the start of domain names). A domain name, technically, need be only one label. However in practice domain names are made up of at least two labels, so for the purposes of validation we will check for two. A domain name may not be over 255 characters in total. A domain portion of an email address may also be an IP address, which can in turn be enclosed in square brackets.

In order to check that email addresses conform to these guidelines, we'll need to use regular expressions. First, we need to match the three possible forms of the local part of an email address, using the two patterns below (we'll add in escape characters later, when we put the function together):

^[A-Za-z0-9!#$%&'*+-/=?^_`{|}~][A-Za-z0-9!#$%&'*+-/=?^_`{|}~\.]{0,63}$ ^"[^(\|")]{0,62}"$

We can use the two patterns we've defined here to check for obsolete local parts of email addresses too, saving ourselves from needing a third pattern.

Next, we need to check the domain portion of the email address. It can either be an IP address or a domain name, so we can use the two patterns here to validate it:

^\[?[0-9\.]+\]?$ ^[A-Za-z0-9][A-Za-z0-9-]*[A-Za-z0-9](.[A-Za-z0-9][A-Za-z0-9-]*[A-Za-z0-9])+$

The above pattern will match any valid domain name, but will also match an IP address, so we only need the above to check the "domain" portion of the email.

Putting it all together gives us the following function. Call it like any normal function, and you will get back a value of "true" if the string entered is a valid email address, or "false" if the input was an invalid email address.

function check_email_address($email) { // First, we check that there's one @ symbol, and that the lengths are right if (!ereg("^[^@]{1,64}@[^@]{1,255}$", $email)) { // Email invalid because wrong number of characters in one section, or wrong number of @ symbols. return false; } // Split it into sections to make life easier $email_array = explode("@", $email); $local_array = explode(".", $email_array[0]); for ($i = 0; $i < sizeof($local_array); $i++) { if (!ereg("^(([A-Za-z0-9!#$%&'*+/=?^_`{|}~-][A-Za-z0-9!#$%&'*+/=?^_`{|}~\.-]{0,63})|(\"[^(\\|\")]{0,62}\"))$", $local_array[$i])) { return false; } } if (!ereg("^\[?[0-9\.]+\]?$", $email_array[1])) { // Check if domain is IP. If not, it should be valid domain name $domain_array = explode(".", $email_array[1]); if (sizeof($domain_array) < 2) { return false; // Not enough parts to domain } for ($i = 0; $i < sizeof($domain_array); $i++) { if (!ereg("^(([A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9])|([A-Za-z0-9]+))$", $domain_array[$i])) { return false; } } } return true; }

Using the function above is relatively simple, as you can see:

if (check_email_address($email)) { echo $email . ' is a valid email address.'; } else { echo $email . ' is not a valid email address.'; }

You can now validate email addresses entered into your site against the specifications that define email addresses (more or less - domain names that start with a number are supposed to be invalid, but do exist).

Finally, please do remember that because an email looks valid does not mean it is in use. Using a script for validating email addresses is a good start to email address validation, but though it can tell you an email address is technically valid it cannot tell you if it is in use. You might benefit from checking in more depth, for example seeing if a domain name is registered. Even better, fire off an email to the address given by a user and get them to click a link to confirm it is real - the only way to be 100% sure.



]]>
Tue, 01 Jun 2004 13:16:31 +0100 http://www.addedbytes.com/blog/code/email-address-validation/ Dave Child ,,,,,,,,,,