Tagged with "regex" http://www.addedbytes.com/feeds/tag-feed/ en Web Development in Brighton - Added Bytes 2006 120 Email Address Validation http://www.addedbytes.com/blog/code/email-address-validation/ PLEASE NOTE: This function is now considered out of date. An updated version incorporating many of the comments below has been released under an open source license as a Google Code project: php-email-address-validation. There is more about this change in the post Email Address Validation Updated.

Many email address validators will actually throw up errors when faced with a valid, but unusual, email address. Many, for example, assume that an email address with a domain name extension of more than three letters is invalid. However, new TLDs such as ".info", ".name" and ".aero" are perfectly valid but longer than three characters. Many email address validators fail to take into account that you do not necessarily need a domain name in an email address - an IP address is fine.

The first step to creating a PHP script for validating email addresses is to work out exactly what is and is not valid. RFC 2822, that specifies what is and is not allowed in an email address, states that the form of an email address must be of the form "local-part @ domain".

The "local-part" of an email address must be between 1 and 64 characters in length and may be made up in any one of three ways. It can be made up of a selection of characters (and only these characters) from the following selection (though the period can not be the first of these):

  • A to Z
  • 0 to 9
  • !
  • #
  • $
  • %
  • &
  • '
  • *
  • +
  • -
  • /
  • =
  • ?
  • ^
  • _
  • `
  • {
  • |
  • }
  • ~
  • .

Or, it can be made up of a quoted string containing any characters except "\". Older email addresses may be made up differently, and may contain a combination of the above. The following are all valid as the first part of an email address:

  • dave
  • +1~1+
  • {_dave_}
  • ""
  • dave."dave" (Note that this is considered an obsolete form of address - new addresses created should not be of this form, but it is still considered valid.)

The following, though similar, are all invalid:

  • -- dave -- (spaces are invalid unless enclosed in quotation marks)
  • [dave] (square brackets are invalid, unless contained within quotation marks)
  • .dave (the local part of a domain name cannot start with a period)

The "domain" portion of the email address can also be made up in different ways. The most common form is a domain name, which is made up of a number of "labels", each separated by a period and between 1 and 63 characters in length. Labels may contain letters, digits and hyphens, however must not begin or end with a hyphen (officially, a label must begin with a letter, not a digit, however many domain names have been registered beginning with digits so for the purposes of validation we will assume that digits are allowed at the start of domain names). A domain name, technically, need be only one label. However in practice domain names are made up of at least two labels, so for the purposes of validation we will check for two. A domain name may not be over 255 characters in total. A domain portion of an email address may also be an IP address, which can in turn be enclosed in square brackets.

In order to check that email addresses conform to these guidelines, we'll need to use regular expressions. First, we need to match the three possible forms of the local part of an email address, using the two patterns below (we'll add in escape characters later, when we put the function together):

^[A-Za-z0-9!#$%&'*+-/=?^_`{|}~][A-Za-z0-9!#$%&'*+-/=?^_`{|}~\.]{0,63}$ ^"[^(\|")]{0,62}"$

We can use the two patterns we've defined here to check for obsolete local parts of email addresses too, saving ourselves from needing a third pattern.

Next, we need to check the domain portion of the email address. It can either be an IP address or a domain name, so we can use the two patterns here to validate it:

^\[?[0-9\.]+\]?$ ^[A-Za-z0-9][A-Za-z0-9-]*[A-Za-z0-9](.[A-Za-z0-9][A-Za-z0-9-]*[A-Za-z0-9])+$

The above pattern will match any valid domain name, but will also match an IP address, so we only need the above to check the "domain" portion of the email.

Putting it all together gives us the following function. Call it like any normal function, and you will get back a value of "true" if the string entered is a valid email address, or "false" if the input was an invalid email address.

function check_email_address($email) { // First, we check that there's one @ symbol, and that the lengths are right if (!ereg("^[^@]{1,64}@[^@]{1,255}$", $email)) { // Email invalid because wrong number of characters in one section, or wrong number of @ symbols. return false; } // Split it into sections to make life easier $email_array = explode("@", $email); $local_array = explode(".", $email_array[0]); for ($i = 0; $i < sizeof($local_array); $i++) { if (!ereg("^(([A-Za-z0-9!#$%&'*+/=?^_`{|}~-][A-Za-z0-9!#$%&'*+/=?^_`{|}~\.-]{0,63})|(\"[^(\\|\")]{0,62}\"))$", $local_array[$i])) { return false; } } if (!ereg("^\[?[0-9\.]+\]?$", $email_array[1])) { // Check if domain is IP. If not, it should be valid domain name $domain_array = explode(".", $email_array[1]); if (sizeof($domain_array) < 2) { return false; // Not enough parts to domain } for ($i = 0; $i < sizeof($domain_array); $i++) { if (!ereg("^(([A-Za-z0-9][A-Za-z0-9-]{0,61}[A-Za-z0-9])|([A-Za-z0-9]+))$", $domain_array[$i])) { return false; } } } return true; }

Using the function above is relatively simple, as you can see:

if (check_email_address($email)) { echo $email . ' is a valid email address.'; } else { echo $email . ' is not a valid email address.'; }

You can now validate email addresses entered into your site against the specifications that define email addresses (more or less - domain names that start with a number are supposed to be invalid, but do exist).

Finally, please do remember that because an email looks valid does not mean it is in use. Using a script for validating email addresses is a good start to email address validation, but though it can tell you an email address is technically valid it cannot tell you if it is in use. You might benefit from checking in more depth, for example seeing if a domain name is registered. Even better, fire off an email to the address given by a user and get them to click a link to confirm it is real - the only way to be 100% sure.



]]>
Tue, 01 Jun 2004 13:16:31 +0100 http://www.addedbytes.com/blog/code/email-address-validation/ Dave Child ,,,,,,,,,,
VBScript Regular Expressions http://www.addedbytes.com/blog/code/vbscript-regular-expressions/ Regular expressions in VBScript are two words that can bring many to their knees, weeping, but they are not as scary as some would have you believe. With their roots in Perl, regular expressions in VBScript use similar syntax, and the chances are that you may already be familiar with the concepts here if you have played with regular expression matching before.

Below, you will find three sections. The first section, Reference, is a simple reference listing the most-used of the various symbols and characters used in regular expressions. The second section, Functions, has two functions in it that may make life easier for you. The third section, Examples, is where the fun begins - examples of regular expressions in action.

Reference

Character Sets and Grouping

  • . - Any single character (except new line character, "\n")
  • [] - Encloses any set of characters
  • ^ - Matches any characters not within following set
  • [A-Z] - Any upper case letter between A and Z
  • [a-z] - Any lower case letter between a and z
  • [0-9] - Any digit from 0 to 9
  • () - Group section. Also can then be back-referenced with $1 to $n, where n is the number of groups
  • | - Or. (ab)|(bc) will match "ab" or "bc"

Repetition

  • + - One or more
  • * - Zero or more
  • ? - Zero or one
  • {5} - Five
  • {1,3} - One to three
  • {2,} - Two or more

Positioning

  • ^ - Start of string
  • $ - End of string
  • \b - End of word
  • \n - New line
  • \r - Carriage return

Miscellaneous

  • \ - Escape character
  • \t - Tab
  • \s - White space
  • \w - Matches word (equivalent of [A-Za-z0-9_])

Please note that the escape character mentioned above is not usable in normal VBScript. Regular expression syntax is based upon Perl regular expression syntax. To escape a character in VBScript, you usually double it. For example, the following will print out 'This is a "quoted" piece of text'.

response.write("This is a ""quoted"" piece of text.")

Functions

The first of the functions below, ereg (named after the PHP function to keep me from going quite quite mad), is the one you will probably use most. Simply put, if you feed in a string, pattern, and choose whether or not you would like to ignore the case of letters in either, the function will return TRUE if the string contains the pattern, or FALSE if not.

function ereg(strOriginalString, strPattern, varIgnoreCase) ' Function matches pattern, returns true or false ' varIgnoreCase must be TRUE (match is case insensitive) or FALSE (match is case sensitive) dim objRegExp : set objRegExp = new RegExp with objRegExp .Pattern = strPattern .IgnoreCase = varIgnoreCase .Global = True end with ereg = objRegExp.test(strOriginalString) set objRegExp = nothing end function

Next up we have ereg_replace. Like it's shorter cousin, you need to feed it a string, a pattern and choose your case sensitivity. This time, you must also add a replacement. This function will replace all instances of the pattern with the replacement in the string (if you change ".Global = True" to ".Global = False" then the function will only replace the first instance of the pattern with the replacement).

function ereg_replace(strOriginalString, strPattern, strReplacement, varIgnoreCase) ' Function replaces pattern with replacement ' varIgnoreCase must be TRUE (match is case insensitive) or FALSE (match is case sensitive) dim objRegExp : set objRegExp = new RegExp with objRegExp .Pattern = strPattern .IgnoreCase = varIgnoreCase .Global = True end with ereg_replace = objRegExp.replace(strOriginalString, strReplacement) set objRegExp = nothing end function

Examples

Example 1: Checking hexadecimal string

A hexadecimal number can be made up of any digit, and any letter, upper or lower case, between a and f, inclusive. So to check if a string is actually hexadecimal, the following will do quite nicely (strOriginalString is the original string to be tested):

<% if ereg(strOriginalString, "[^a-f0-9\s]", True) = True then response.write "String is not hexadecimal." else response.write "String is hexadecimal." end if %>

The pattern, "[^a-f0-9\s]" matches anything that is not in the set of characters specified (so if there is anything in the string that is not in that set, the function will return True). The characters specified are all letters between a and f inclusive, and we've specified a case insensitive match, so upper case letters will be treated the same way. We are also allowing whitespace (new lines, spaces, carriage returns and tabs), which is what the "\s" represents in regular expressions.

Example string that returns False (and is therefore hexadecimal):

AAcc99

Example 2: Masking the last section of an IP address

An IP address is made up of four sets of numbers seperated by periods. It's common practice, if you are going to display visitor (or any) IP address on your site, to mask the last (fourth) set of numbers. Here's a way to use ereg_replace to do just this:

<% strOriginalString = ereg_replace(strOriginalString, "([^0-9])([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.[0-9]{1,3}([^0-9])", "$1$2.$3.$4.***$5", True) %>

This is a little more tricky, as you'd hopefully expect from a second example. It looks harder than it is though, so one step at a time. There are actually only a few entities in the pattern - they are just repeated. The most important is this: "([0-9]{1,3})". It matches a section of an IP adress, and is enclosed in brackets so that this section can be used in the replacement of the pattern as well (otherwise we would not be able to keep the first three parts of the IP address to display). You can see these sections in use, referenced with "$2", "$3" and "$4" in the replacement. The pattern within the brackets simply says "between one and three digits between 0 and 9".

The second repeated section is "\.". We use a backslash before the period to indicate that this period (the character following the backslash) is to be treated as a normal period. We call this an escaped character, and this is a fairly common practice. The period, unescaped (without the backslash), is used as a symbol representing "any character except the new line character".

Example input text:

My IP address is 123.456.78.9 but 4444.1.1.1 is just a bunch of random numbers, and so is 12.34.56, and 1.1.1.1 is another valid IP.

Example output text:

My IP address is 123.456.78.*** but 4444.1.1.1 is just a bunch of random numbers, and so is 12.34.56, and 1.1.1.*** is another valid IP.

Example 3: Making the second word of every sentence in a string bold, as long as the word before only contains upper case letters and the second word does not contain an even digit

Getting more interesting now, this example is not in the least bit useful in practice, but should prove to be a useful demonstration of the power of regular expressions. It sounds tough - but with regular expressions, it's a walk in the park.

<% strOriginalString = ereg_replace(". " & strOriginalString, "(\.|!|\?)\s([A-Z]+)\s([^02468\s]+)\s", "$1 $2 <strong>$3</strong> ", False) strOriginalString = mid(strOriginalString, 2) %>

We start by adding an artificial period and space to the beginning of the string, just to make sure we catch the first sentence, and add a line to strip our extra characters out afterwards. We only want those sentences split with punctuation and a space, or we'll end up with bold decimals and it will be very messy indeed. So, we check for puncuation, followed by a space, followed by a word made entirely of capitals, followed by another space, followed by a second word that doesn't contain even numbers, or whitespace, followed by a space. If we find that, we replace it with the same items we picked up in brackets, only with a <strong></strong> tag pair around the second word.

Example input text:

THE quick brown fox jumped over the lazy dog? Many red balloons blew up! EVEN num2ber sentence. ODD num3ber sentence.

Example output text:

THE quick brown fox jumped over the lazy dog? Many red balloons blew up! EVEN num2ber sentence. ODD num3ber sentence.

]]>
Fri, 07 Nov 2003 09:29:40 +0000 http://www.addedbytes.com/blog/code/vbscript-regular-expressions/ Dave Child ,,,,,,,,,,,