Skip Navigation

Blog » On-The-Fly Validation

A tool to help automatically fix most common (X)HTML errors before outputting a page to the user.

Of all coding errors in websites (usually highlighted by code validators), there are a few that crop up time and time again. These common coding bugs account for more 90% of the mistakes in web sites. Despite being so prevalent, most designers still allow these basic errors to creep into their code.

Note: Many people may consider "validation errors" unimportant. However, if you are going to write a web page in a specific language it makes sense to actually use that language properly, rather than making up your own random dialect. After all, can you be sure that that dialect will be interpreted the same way every single time? Any while many people find errors like this easy to ignore, they should remember that while they might not stop a page being usable, what validators bring up are "coding errors" - mistakes in the markup of the page.

These scripts are intended to make life a touch easier for busy developers. Of course, these scripts will slow down your site, and are no substitute for actually writing valid code in the first place. They are intended to catch the occasional bug that you may have missed, or that may be introduced through a comments system, for example.

In order to make use of the following code, you will need to be using PHP 4 or higher on Apache. The following scripts make use of output buffering and work best with a caching system in place as well.

This script will not remove all of your validation errors. It cannot remove them all without running very slowly - there is a lot to check in each document. However, it can catch a few of the more common bugs that most designers miss at least once in a website.

To begin, the scripts we use start output buffering. This means that rather than send the page to the user as it is created, the page is saved on the server until the server is told to output the page (or the script ends). This will allow us to modify page output without needing to worry about editing the PHP behind it. To start output buffering, you need to include the following code at the top of each page. You can include it using the "include()" or "require()" functions, or using htaccess's superb auto_prepend_file function (which you can see in use in this caching tutorial).

ob_start();

After the script has run, we need to include another script at the end to process the page and output it to the user. You can, again, use "include()" or "require()", or auto_append_file in htaccess to include this script.

The script itself runs in three steps. The first step, below, grabs the contents of the output buffer. This will create a variable called "$output" that contains the page we were about to send to the user. $output contains the HTML after all PHP has run as normal, so the variable literally only contains what the user would normally see. The second line empties the output buffer (but does not send its contents to the user).

$output = ob_get_contents();
$output = trim($output);
ob_end_clean();

Now $output contains the page we are about to send the user, it is time to run the various checks we want, to make sure there are no validation errors in place.

if ((strpos($output, "<!DOCTYPE") > strpos($output, "<html")) or (strpos($output, "<!DOCTYPE") === false)) {
    $output = str_replace('$lt;html', "$lt;!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n$lt;html", $output);
}

First, we check for a DTD. These are important, as they tell the user agent (e.g. the browser) what language a page is written in. The above checks for the presence of a DTD before the <html> tag, and if it is missing it adds in the DTD for HTML 4.01 Transitional - probably the most common one in use today.

function encode_chars($text) {
    $text = str_replace("<", "&lt;", $text);

    $tag_list = '((\/?)(!DOCTYPE|!--|a(bbr|cronym|ddress|pplet|rea)?|b(ase(font)?|do|ig|lockquote|ody|r|utton)?|c(aption|enter|ite|(o(de|l(group)?)))|d(d|el|fn|i(r|v)|l|t)|em|f(ieldset|o(nt|rm)|rame(set)?)|h(1|2|3|4|5|6|ead|r|tml)|i(frame|mg|n(put|s)|sindex)?|kbd|l(abel|egend|i(nk)?)|m(ap|e(nu|ta))|no(frames|script)|o(bject|l|pt(group|ion))|p(aram|re)?|q|s(amp|cript|elect|mall|pan|t(r(ike|ong)|yle)|u(b|p))|t(able|body|d|extarea|foot|h|itle|r|t)|u(l)?|var)([^>]*))';

    $text = preg_replace("/(&lt;)" . $tag_list . "(>)/mi", "<$2>", $text);
    $text = preg_replace("/(>[^<]*)>/mi", "$1&gt;", $text);
    $text = str_replace("/>", ">", $text);

    return $text;
}

$output = encode_chars($output);

Next, we run a function on the script to check the tags on the page. Any tags that don't belong there are encoded so they are displayed rather than processed. We use the HTML 4.01 tag list, which means we will catch the worst of the invalid tags.

$output = preg_replace("/<img([^>]*)alt=([^>]*)>/im", "<img$1`alt=$2>", $output);
$output = preg_replace("/<img([^`|>]*)>/im", "<img alt=\" \"$1>", $output);
$output = preg_replace("/<img([^>]*)`alt=([^>]*)>/im", "<img$1alt=$2>", $output);

This small snippet of code checks for alt attributes on images. If they are missing, it adds a single space as an alt attribute. This is by no means optimal (and the regex and technique is ugly - if anyone can improve on this, please give me a shout!), however does mean that if an alt attribute is missed, screen readers will not simply give the name of the image file. You should always take the greatest care to ensure that all images have appropriate alt attributes.

Next, we do a little language-specific work. In the above code, we removed all closing slashes (e.g. in a <br /> tag). Now, if we are using XHTML, we add them back in for the appropriate elements. We also check the case of elements if using XHTML, as tags and attributes must be lower case in XHTML. This will only affect attributes whose values are quoted.

function process_attributes($text) {
    return preg_replace("/ ([a-z]+)=\"([^( |\")]*)\"/mie", "' ' . strtolower('$1') . '=\"' . stripslashes('$2') . '\"'", $text);
}

if (strpos($output, "//W3C//DTD XHTML") !== false) {
    $output = encode_chars($output, "XHTML");
    $output = preg_replace("/<(img|hr|meta|link|br|base|frame|input)([^>]*)>/mi", "<$1$2 />", $output);
    $output = preg_replace("/<(\/?)([a-z]+)( |>)/mie", "'<$1' . strtolower('$2') . '$3'", $output);
    $output = preg_replace("/<([^>]+)>/mie", "'<'.process_attributes(stripslashes('$1')).'>'", $output);
    $output = preg_replace("/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/i", "&", $output);
}

We also, at the end, encode any ampersands that should be encoded. Many thanks to [url=http://www.shauninman.com]Shaun Inman[/url] for the last line.

Finally, we need to send the processed output to the user.

$output = str_replace("<b>", "<strong>", $output);
$output = str_replace("<i>", "<em>", $output);
$output = str_replace("</b>", "</strong>", $output);
$output = str_replace("</i>", "</em>", $output);
echo $output;

At this stage, the code sent to the user will have a valid Document Type Definition. All tags will be correctly closed whether using HTML or XHTML. All images will have alt attributes. If we're using XHTML, all tags and attributes will be lower case (as long as the attributes are quoted). All invalid opening and closing tags will have been encoded. All ampersands should be properly encoded. And for good measure, we've replaced all bold (<b>) and italic (<i>) tags with the proper <strong> and <em> tags.

If you put it all together, you get the following code to be included at the end of each script:

<?php

$output = ob_get_contents();
$output = trim($output);
ob_end_clean();
    
function process_attributes($text) {
    return preg_replace("/ ([a-z]+)=\"([^( |\")]*)\"/mie", "' ' . strtolower('$1') . '=\"' . stripslashes('$2') . '\"'", $text);
}

function encode_chars($text) {
    $text = str_replace("<", "&lt;", $text);

    $tag_list = '((\/?)(!DOCTYPE|!--|a(bbr|cronym|ddress|pplet|rea)?|b(ase(font)?|do|ig|lockquote|ody|r|utton)?|c(aption|enter|ite|(o(de|l(group)?)))|d(d|el|fn|i(r|v)|l|t)|em|f(ieldset|o(nt|rm)|rame(set)?)|h(1|2|3|4|5|6|ead|r|tml)|i(frame|mg|n(put|s)|sindex)?|kbd|l(abel|egend|i(nk)?)|m(ap|e(nu|ta))|no(frames|script)|o(bject|l|pt(group|ion))|p(aram|re)?|q|s(amp|cript|elect|mall|pan|t(r(ike|ong)|yle)|u(b|p))|t(able|body|d|extarea|foot|h|itle|r|t)|u(l)?|var)([^>]*))';

    $text = preg_replace("/(&lt;)" . $tag_list . "(>)/mi", "<$2>", $text);
    $text = preg_replace("/(>[^<]*)>/mi", "$1&gt;", $text);
    $text = str_replace("/>", ">", $text);

    return $text;
}

if ((strpos($output, "<!DOCTYPE") > strpos($output, "<html")) or (strpos($output, "<!DOCTYPE") === false)) {
    $output = str_replace('<html', "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/html4/loose.dtd\">\n<html", $output);
}

$output = encode_chars($output);
$output = preg_replace("/<img([^>]*)alt=([^>]*)>/im", "<img$1`alt=$2>", $output);
$output = preg_replace("/<img([^`|>]*)>/im", "<img alt=\" \"$1>", $output);
$output = preg_replace("/<img([^>]*)`alt=([^>]*)>/im", "<img$1alt=$2>", $output);

if (strpos($output, "//W3C//DTD XHTML") !== false) {
    $output = preg_replace("/<(img|hr|meta|link|br|base|frame|input)([^>]*)>/mi", "<$1$2 />", $output);
    $output = preg_replace("/<(\/?)([a-z]+)( |>)/mie", "'<$1' . strtolower('$2') . '$3'", $output);
    $output = preg_replace("/<([^>]+)>/mie", "'<'.process_attributes(stripslashes('$1')).'>'", $output);
    $output = preg_replace("/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w{1,8});)/i", "&", $output);
}

$output = str_replace("<b>", "<strong>", $output);
$output = str_replace("<i>", "<em>", $output);
$output = str_replace("</b>", "</strong>", $output);
$output = str_replace("</i>", "</em>", $output);
echo $output;

?>

comments powered by Disqus