Chapter 8: Processing Input

Google

Tracking traffic with cookies and a database

There is one thing I'd better explain right here. This script requires that you have a database available. It also requires that the database have a table named "hits". Chapter 14 will cover setting up your database. If you want to use this script right away, skip to Chapter 14. Get your database set up and write down the database name, username and password. You'll use them to edit some values in this script so that it will work.

You'll also need to create the "hits" table. I'm not going to insist that you name the table "hits" - although it seems like a perfectly good name to me. You can change the table name to something else if you prefer. Just be sure you edit the line in the script that starts out like this:

$sql = "INSERT INTO hits(ip,date,browser...

Make sure that where the original line said "hits" you put in the real name of your table.

I'm going to make several assumptions - although they may not hold true for everyone. First, I'll assume that your web site is located on a host that supports MySQL. If it isn't...then I strongly suggest you move to a new host. Not having access to a database is like having one hand tied behind you and one foot nailed to the floor.

Based on my first assumption, I'll assume that your host provides a web-based control panel such as cPanel, Exim or Plesk. In general, these control panels have a link or an icon that takes you to a control panel for your database(s).

In cPanel, for example, you click on the MySQL databases icon or link. Then you see a page where you can see, add or drop databases and database users. Down at the bottom of that page is a link labeled phpMyAdmin.

As you may have guessed, phpMyAdmin is a PHP script that gives you total control over your databases, their tables and the data inside the tables. As soon as you have created a database, you can go into phpMyAdmin and make the tables you want.

Other control panels and other web hosting setups may use a different script for managing databases. Whichever one you have, be sure to read the Help files or other documentation.

Now here's the command that will create the table for you:

CREATE TABLE `hits` (
`ip` varchar(16) default NULL,
`date` datetime default NULL,
`browser` varchar(16) default NULL,
`os` varchar(16) default NULL,
`engine` varchar(100) default NULL,
`lang` varchar(16) default 'Other',
`keywords` varchar(255) default NULL,
`page` varchar(100) default NULL,
`newuser` varchar(16) default 'No',
`last_visit` datetime default NULL,
`days_ago` tinyint(4) default NULL,
KEY `index1` (`date`)
) TYPE=MyISAM;

I know you're anxious to see how this is done, so I'll get right to it. The first thing you need to do is look for a cookie set by the last visit. That might sound odd, since there may not be a cookie yet. But if there is a cookie - you need to read it before you update it. If no cookie is found, then you'll set initial values and write one. If there is a cookie, you'll need to know its current values before you set new ones. Let's start like this:

<?php
//Determine if user is new to the website
$visit1 = $_COOKIE["visit1"];
$visit2 = $_COOKIE["visit2"];
//Determine if user is new to the website
$newuser=($visit1=="")?"Yes":"No";

We're dealing with two cookies here. The cookie containing the value visit1 tells us if there has been a previous visit and the date it occurred. The other cookie will store the time of the current visit. The visit1 value is stored in the normal date format of Year-Month-Date Hour:Minute:Second. The visit2 value is stored in the system time format (the number of seconds since Jan 1, 1970).

Look closely at the last line in the code fragment above. Do you see something that looks strange? Hint: it's the ?. As its being used here, it's an operator. Here's how it works...

The condition in parentheses is evaluated to determine if it's true or false. After the ?, there are two options separated by a colon. If the condition is true, the first option is chosen. If it's false, the second option is taken. Whichever option is selected is then assigned to the variable to the left of the = sign.

So what the line really says is, "If there's no data in visit1, then (Yes) this is a new user. If there is data in visit1 then (No) this is not a new user." The value of $newuser will be stored in the database when this page view is recorded. Now let's look at some more code.

// Determine date and time
$date = date("Y-m-d h:i:s");
//Get date and time of last visit
$visit=($visit1)?$visit1:$date;

I've used the date() function to get the current time and date. $visit will become either the date stored in $visit one (it's a returning visitor) or the current date (it's a new visitor). This is another example of using the ? operator to make a decision and take action based on that decision. A statement like this is actually a very compressed version if an if-then-else statement. Let's continue.

//Get number of days since last visit
$days_ago=($visit2!="")?intval( (time()-intval($visit2)) / (60*60*24) ):0;
//If result is a decimal number, then set it to 0
if($days_ago<1) $days_ago=0;

Here, the value of visit2 is tested. If there is no value, then the number of days since the last visit is 0 because this is the first visit by this user. Otherwise, I have to do some math. intval($visit2) guarantees that visit2 will be a whole number (an integer). time()-intval($visit2) is the number of seconds since the last visit. Dividing that by (60*60*24) - the number of seconds in a day - gives me the number of days since the last visit. intval() is used one last time to make sure that the number of days is an integer.

If the answer I got was less than 1 whole day...I conclude that it has been zero days since the last visit. The final value of $days_ago will be stored in the database. The values of $date and $visit will also be stored. These are the only values that required the use of cookies. Everything else can be worked out from data we already have.

In fact, the other data items could be collected without using cookies. So if the user has turned off all cookies in the browser, we'll still know quite a bit about the visit. Here's the rest of the code for setting the cookies(yes, there are two of them).

//Set cookie expire time in (90 days * 24 hours * 3600 seconds per hour)
$cookie_expire=time()+(90*24*3600);

You have to set the expire time of the cookie in seconds. I chose to make my cookie expire in 90 days. If you want a different expire time, change the 90 to a different number of days. You can play around with the expire time calculation if you like; it's just simple arithmetic.

//Set last visit date and time in one cookie
setcookie("visit1", $date, $cookie_expire);

Here we're sending visit1 as the name of the cookie, $date as the value and $cookie_expire as the expire time. Simple enough? The domain will be assumed to be the top-level domain of your site (e.g., www.mysite.com) Hint: if you want the cookie to be good only for a certain page, be sure to set and send the domain and path variables.

//Set current timestamp in the other cookie
setcookie("visit2", time(), $cookie_expire);

Here we're sending visit2 as the name of the cookie, the current system time in seconds as the value and $cookie_expire as the expire time. The same rules that apply to the first cookie apply to this cookie.

// Find IP address
$ip = $REMOTE_ADDR

It's nearly impossible for the user to hide the IP address, since the browser sends it along with the page request. However, if the user is sending through a "proxy server" (for surfing anonymously) we'll get the IP address of the proxy server instead of the "real" IP address. So this address may or may not be accurate...

// Find browser and operating system
$agent=split("; ", getenv("HTTP_USER_AGENT"));
$browser=$agent[1];
$os=ereg_replace("\)", "", $agent[2]);

The environment variable HTTP_USER_AGENT contains the browser name and version as well as the user's operating system(OS - Windows 2000, etc.). Since these items will be separated by a semi-colon, I use that to split them apart and put them in a list. I don't need the first item in the list...so I ignore it. The second item is the browser and the 3rd item is the OS. There's a ) right after the OS, so I use the ereg_replace() function to replace it with "nothing" - thus getting rid of it.

// Get referring page
$url = $HTTP_REFERER;
if($url == "") {$url="Not Available";}

If the visitor followed a link from another site, HTTP_REFERER will tell us what that link looked like. If not, I just mark it as "Not Available". The reason for doing that is so that when I'm rummaging through the database I can see how many hits came from people who typed the URL into their browser - these are what I like to call "direct hits".

// determine which page was viewed
$uri = $REQUEST_URI;
if($uri == "/") {$uri = "home page";}

$REQUEST_URI is another environment variable. It contains everything after the first / in the URL (after the http:// part, of course). What does that mean exactly? Well, for the home page, there's nothing after the "www.mysite.com" except for a /. So that trailing / is all you get in $REQUEST_URI - unless the hit you just tracked had followed a link from one page back to the home page.

Did that make sense? Look at it this way...suppose you're on the www.mysite.com/links page. If you click a link marked "Home" or whatever that takes you back to the home page, you'll see the path (in the address bar of your browser) as:

http://www.mysite.com/index.html or http://www.mysite.com/index.php

Instead of just:

http://www.mysite.com/

This time $REQUEST_URI contains "/index.html" or "/index.php" instead of just "/".

I chose to have the code change "/" to "home page" - just to make it more intuitive when I search the database. You don't have to do it if you don't want to. Hey, it's your site, your system and your data. Do whatever works best for you.

For any other page, $uri will end up holding something like:

/path/to/some/page.php ... and that's what will be recorded in your database.

That's a key point! When you search the database, you can look for hits on a specific page. You can filter them by a date, a range of dates, one or more keywords or keyword phrases, etc. Bottom line: you can search based on any kind of data you've stored.

$languages=Array("en-us"=>"English-US","en-gb"=>"English-GB","en-au"=>"English-AUS",
"en-ca"=>"English-CA","en"=>"English","de"=>"German", "zh"=>"Chinese", "es"=>"Spanish",
"fr"=>"French","ja"=>"Japanese", "it"=>"Italian");

// Now get the user's language(s)
$language=getenv("HTTP_ACCEPT_LANGUAGE");
$lang=ereg_replace("\;\.\*", "", $language);
$lang=$languages[$lang];
if($lang == "") {$lang = "Other";}

When the user first set up a browser, a value was set that indicates that user's preferred language(s). This value is always sent with a page request in the environment variable HTTP_ACCEPT_LANGUAGE. I've made an array that lists some of the more common languages and indexed it by the abbreviations you find in HTTP_ACCEPT_LANGUAGE.

I'm discarding the characters in HTTP_ACCEPT_LANGUAGE that occur before the language abbreviation. I'm also choosing to ignore the fact that there can be more than one language value. The first one I find will always be the user's most preferred language - and that's good enough for me.

Using the abbreviation as an index I can look up the full name of the language from the array. If there's no matching entry in the array, I just substitute "Other" for the name of the language. Obviously the more complete language entries you have in your array, the less often you'll fail to get a match. If you really expect a lot of visitors whose language isn't listed, find out what the abbreviation is - and add another entry to the array.

$se=Array("yahoo", "google", "altavista", "alltheweb", "msn");
// determine which search engine was used
foreach($se as $k => $v) {
  if (ereg($v, $url)) {$engine = $v; break;}
  else {$engine = "None or Unknown";}
}

Remember that we already got the referring page (a link that the visitor followed)? Good... I've made a simple array of the smallest strings necessary to work out which search engine sent the visitor here. If the link contains "google", the visitor came from www.google.com, etc.

What I'm doing now is using a foreach() loop to look at each entry in the array - and it doesn't matter exactly how many entries there are. A foreach() loop will simply stop after it reads the last item in the array.

For each entry in the array, the loop grabs the index (or "key", represented by $k) and its corresponding value (represented by $v). The ereg() function is used to see if $v occurs anywhere in the current value of $url. If it does, $engine is set to $v and the break statement stops the loop.

Once we have a match, there's no point wasting time looking for something we know we won't find. That's why I stop the loop with break; If we get no match at all, we just save something like "None or Unknown". It might be that the search engine wasn't in our list - or it might mean the visitor didn't come from a search engine at all.

$vars = split("&", $url);
foreach($vars as $k => $v) {
  if(ereg("\+", $v)) {
    $keywords = ereg_replace("\+", " ", ereg_replace(".*q\=", "", $v));
    break;
 }
}
$keywords = ereg_replace("\%\2\2", "", $keywords);

Search engines create links to web pages. Those links have a "query string". You'll learn about them in depth in Chapter 9 but here's a quick overview. A query string starts at the end of a path (a URL that leads to a certain page or script) and begins with a ?. What follows the ? is one or more key/value pairs, separated by &. They look like assignment statements: email=me@here.org&id=23456, etc.

So the first task is to split up this big long URL into sections using the & as a split point. Then put them into a list or array. The split() function does this job in one line, putting the results into the array $vars.

In these query stings, the part where the keywords are found is just one of the key/value pairs. What's different is that the keywords are separated by + signs. Why? Well, you wouldn't want the keywords all run together - and spaces aren't allowed in a URL. I don't know why the + sign was chosen as the separator because (unlike Al Gore) I didn't invent the Internet! <grin> The point is simply that that's how it's done.

So as we loop through the broken pieces of this URL, we're looking for a + sign. When we find a piece that has one (or more), we've located the keywords. We use ereg_replace to put spaces where the + signs were and now we have a readable list of keywords. It might be more correct to call it a keyword phrase since the search was based on all the words.

By looking at this piece of the query string closely, I found that the "key" was nearly always "q". So I remove everything up to and including "q=" just before I replace the + signs with spaces. If you discover a search engine that uses a different key, you can make a list of possible keys indexed by the search engines that use them and substitute for "q=" to keep the code happy.

The last line was added to remove any % signs. I noticed that Alta Vista, in particular, kept leaving % signs in the keyword list. A % sign has a special meaning in MySQL databases, so I didn't want to store it and get the database all confused!

In case you're confused about the way I removed them...it's because % is sent as %22. That's a numeric representation of the actual % character. The % sign you see indicates that it's a decimal number standing for an ASCII character. If it all sounds pretty twisted...it's because % itself has this special meaning when used in a URL.

"Act normal - there's crazy enough already." - Dutch proverb

At last! Now we have everything we want to write into the "hits" table in the database - to log the info about this visit. Here's the code for doing that:

// save the data in the database
$db_name = "MyDatabase";
$db_host = "localhost";
$db_user = "MyUserName";
$db_pass = "MyPassword";
$db_conn = @mysql_connect ($db_host, $db_user, $db_pass);
$db = @mysql_select_db($db_name, $db_conn);
$sql = "INSERT INTO hits(ip,date,browser,os,engine,lang,keywords,page,newuser,last_visit,days_ago)
VALUES(\"$ip\",\"$date\",\"$browser\",\"$os\",\"$engine\",\"$lang\",\"$keywords\",\"$uri\",
\"$newuser\",\"$visit\",\"$days_ago\")";
$result = @mysql_query($sql, $db_conn);
?>

Before you use this code, be sure to edit the script and put in real values for $db_name, $db_user and $db_pass. You might have to put in something specific for $db_host, too. But unless your database is located on a different server from your web host - or your host tells you to use a specific value for $db_host, "localhost" should work.

When you get to Chapter 14, I'll teach you (hopefully) as much as you'll need to know about using databases. If you want to see the "nuts and bolts" of how the last section of this script works, please flip to Chapter 14 and it should be easy to follow.

Previous Page   Table of Contents   Next Page

Copyright © 2004 Steve Humphrey