Chapter 8: Processing Input

Google

Validating the data

Now we've got some required data items and possibly some optional items. But are they valid, reasonable and safe? Maybe they are and maybe they aren't. Let's see what we can do about that.

Since I'll be using the ereg() and eregi() functions a lot - I'd better explain them now. They are used to compare strings to see whether have anything that matches.

ereg("pattern", "string") returns 1 (TRUE) if "pattern" occurs in "string". The pattern might be something like "abc" and if those letters occur in that order anywhere in "string", ereg() will return a 1. If not, it returns a 0 (FALSE). There's an optional argument that will contain the number of matches that were found but you very seldom need that.

eregi("pattern", "string") does exactly the same thing but there is a small difference in how it works. It's not case-sensitive but ereg() is case-sensitive. So eregi() will see a match between "dog" and "Dogcatcher" but ereg() won't see a match between "dog" and "Dogcatcher".

Since some characters have special meanings in PHP, they have to be protected or "escaped" when we use them in regular expressions - so they will be treated as simply characters we want to match against. We do this by putting a \ in front of the character. Spaces and the characters &, @, #, *, $, !, + and / are all "special." Inside the square brackets or anywhere else in a regular expression, spaces are ignored! That's why they need a \ in order to be considered part of the pattern.

Zip Code: Let's assume that we're only interested in U.S. zip codes. They always have 5 digits. Sometimes they are followed by a dash and 4 more digits. Here's our test:

if ($zip != "")
{
  if(!ereg("^([0-9]{5})(-[0-9]{4})?$", $zip)) { $fail .= "$zip is an invalid zip code.<br>\n"; }
}
else{ $fail .= "Please enter your zip code<br>\n"; }

Here's how it works. First the if-statement checks to see if anything was entered for the zip code. If not, the code skips down to the else-clause and sets the error message to say the zip code wasn't entered at all. Otherwise, it checks to see if the entry looks like a real zip code. The ereg() function compares what's in $zip against the pattern for a zip code. Pay close attention now. This gets a little tricky.

The first odd thing you might notice is: !ereg("^([0-9]{5})(-[0-9]{4})?$", $zip). The ! means NOT. You see, ereg() returns a value of TRUE (or 1) when it finds a match and FALSE (or 0) when it doesn't find a match. So what the code really says is this:

If it's TRUE that ereg() did not find a match for ^([0-9]{5})(-[0-9]{4})?$ in $zip ... THEN save a failure message we can display to the user.

Now let's examine ^([0-9]{5})(-[0-9]{4})?$ piece by piece. The ^ and $ are called "pattern anchors." The ^ anchor insists that the pattern must match the beginning of the string. In other words, the first character of the string must be part of the pattern we're trying to match. The $ anchor forces the match to occur at the end of the string. When we use both anchors, the entire string must match the pattern defined by our regular expression.

The regular expression is: ([0-9]{5})(-[0-9]{4})?. It's actually two separate patterns which are each surrounded by parentheses. We can combine more than one pattern in a regular expression as long as we enclose them in parentheses. The first pattern is: [0-9]{5}. Here the square brackets ( [ ] ) are used to hold one or more characters that are part of the pattern. Sometimes we put a very specific string inside the brackets; other times we use a "range" of characters.

Fine, you say...but what's a range? It can be 0-9 like in our first pattern, which means "any digit from 0 to 9". Or it could be A-Z which means "any upper-case letter" or a-z which means "any lower-case letter". We can also combine two or more ranges just by putting them together inside the square brackets.

So far, we're saying that we're only interested in matching digits with this pattern - but how many? That's where the { } comes into the picture. We can put a number inside the curly braces to say exactly how many characters (that are in the range we've defined) make up a match. If we're not sure how many characters we should match, we have a few other ways to say "how many?".

We can use * after the range and that will match zero or more of the characters in the range. We can use + after the range and that will match one or more of the characters in the range. We can even write {2..7}, for example, to mean "at least 2 but up to 7" characters from the range will be considered a match.

We know that a basic US Zip Code has 5 digits so we put a 5 inside curly braces like this {5}. Now we're saying we want to match exactly 5 digits, where each one is in the range of 0-9. OK, that's the first pattern. Let's pick the second one apart.

To allow for the possibility of full 9-digit Zip Codes, we have to let the regular expression include exactly 4 digits preceded by a dash. So we use -[0-9]{4} to match that part of a valid US Zip Code. Do you see the dash - outside the square brackets? It means that to match this pattern, there must be a dash followed by 4 digits.

We can't know in advance whether or not the user will type in the optional dash and 4 digits - and that's where the ? pattern-matching character comes in handy. It matches "either zero or one occurrences" of a character (or a pattern). We know that there will be either zero or one copy of the "dash and 4 digits" part - either it's there or it isn't.

What about Canadian and other "postal codes" that include letters or have a different number of characters? Not such a big deal really. Just write a pattern to match them and use that test when $country is not "USA" (or whatever value you're using for the USA.)

A typical Canadian postal code looks like this: T2B 1P7 or T2B1P7. Sometimes the letters are typed in lower-case but you can ignore that to make it simpler. Here's a pattern that will match them:

^[A-Z0-9\s]+$       NOTE: \s represents a space; a "\ " can be used also.

This pattern matches "one or more upper-case letters, digits or spaces" in any order. Zero characters is not a match, because we put the + sign after the pattern. Each character in $zip will be checked to see if it's an upper-case letter, a digit or a space. If any character in $zip isn't one of those - we don't have a match.

Previous Page   Table of Contents   Next Page

Copyright © 2004 Steve Humphrey