Inconsistent Web Analytics Numbers: Google vs. The World
Over the past 11 years, iMarc has used a number of web analytics tools. Whether FunnelWeb, Webalizer, Urchin, Mint, or Google Analytics, the goal is always to understand how people use the web and make optimizations based on that usage.
Recently, we've been recommending Google Analytics. Of course Google Analytics has its limitation and problems, most notably, Javascript and Cookie-acceptance is required by the end-user. That said, Google's ease of use—especially when compared to other reporting software—made it our choice for most clients.
Once we started moving clients sites from Webalizer and Urchin to Google Analytics, we were amazed at the discrepancies in traffic numbers. Google's numbers were much lower— sometimes half, 1/5th, even 1/10th the traffic that Urchin was reporting. Luckily, though the numbers were inconsistent between software packages, traffic trends were almost identical. Moral of that story: don't change reporting tools.
However, once committed to switching from a log-based analyzer like Urchin to Google Analytics, we were determined to learn more about what was causing this discrepancy.
We looked at one of our websites, picked random day, and compared the results from Google Analytics to Urchin/Webalizer (Urchin and Webalizer are different programs but their reporting numbers are almost identical). Since we have access to the raw Apache logfile, we looked at that as well.
Coincidentally, Urchin's results are almost identical to the server's logfile, but Google Analytics is by far the best gauge of true, meaningful traffic. Google Analytics' numbers may be dramatically lower, but they are much more important.
Analyzing 1 Day's Traffic
Here we see how each tool reports the same day's web traffic.
| Website Visitors (or Sessions) | |
|---|---|
| Raw Logfile | 814 |
| Urchin | 1,036 |
| Google Analytics | 379 |
| With the raw logfile, I pulled all unique IP addresses. A number of these IPs came back multiple times throughout the day, presumably causing Urchin's number to be higher. | |
| Pageviews | |
| Raw Logfile | 9,579 |
| Urchin | 10,718 |
| Google Analytics | 1,672 |
| For the logfile number, I added all requests for ".php" pages. Urchin reports .xml, .pdf, .swf, and .txt files in their pageview reports, causing the number to be higher. | |
| Single Page Requests (search.php) | |
| Raw Logfile | 2,441 |
| Urchin | 2,440 |
| Google Analytics | 30 |
| Here, I filtered out all pageviews except the site's search page, /search.php. Looking at these results for a single page or the previous results for all pageviews show huge discrepancies. See below for details... | |
This last comparison is the most telling. Both the raw logfile and Urchin report about 2,440 requests for "search.php". Why is Google Analytics only reporting 30 requests for the same page on the same day? Google seems to be under-reporting 2,411 requests.
Looking at the server log, we find exactly 2,411 requests from browsers (or User Agents) that we probably don't care about. Google Analytics filters all of these out of their reports:
- 2,317 of the requests were from user agent, "Mozilla/5.0 (compatible; Googlebot/2.1)". This is Google, spidering our page. (On a side note, this seems like an insane amount of requests for one page on one day... I guess that's another issue I could look into)
- 47 requests came from a user agent that doesn't identify itself. All these requests came from 3 IP addresses all resolving to the same domain, clients.your-server.de. This person (or script) probably has Javascript turned off. I'm actually glad that Google Analytics is filtering these requests out, as they're obviously not a user we care about. All this user's requests are searches for "<a" or "<script"—most likely a script looking for some vulnerabilities.
- 44 requests came from "Twiceler-0.9 http://www.cuil.com/twiceler/robot.html". Google Analytics is filtering out requests from the new search engine, Cuil.com.
- 1 request from Yahoo/Slurp's robot
- 1 request from user agent, "Java/1.6.0_04"
- 1 request from "FeedHub MetaDataFetcher/1.0 (http://www.feedhub.com)"
So Google's report of 30 requests ends up being much more meaningful than the other log analyzer's report of 2,440 requests. Everything that Google Analytics filters out is either:
- Google's own search engine spider
- Other search engine spiders
- Scripts / Feedburners
- People up to no good.
Google Analytics' focus seems a natural progression of reporting more meaningful data, even if the numbers are lower. In the 1990's it was all about hits (how much more useless can you get?), then it was pageviews, then visitors.
Now Google seems focused on reporting real people—not scripts, robots, spiders, or search engines.
While researching this discrepancy, I did notice a few instances where Google seemed to filter out real people. By following a user's path through the actual logfile, it looked like a few legitimate requests just weren't showing up in Google Analytics. In these cases, the browser's User Agent didn't identify itself. Though extremely rare, I'm guessing these requests came from someone behind a corporate firewall or someone who doesn't accept cookies and keeps their browser in its most secure state. Again, these requests were so rare, they wouldn't have affected the report much anyway.
I'll be happy switching to Google Analytics and believing their numbers represent real, meaningful traffic.
Comments
Lesson of the day, when reporting traffic for potential ad placement..use Urchin numbers.
Nice write-up Dave! I was just going through a similar process on Flourish for my download counter. The counter seemed a bit higher than I expected, so I looked through the web server logs. It turned out I have quite a number of requests from Googlebot, Slurp, Java and a whole host of other search engines.
I would be really nice if there was a standard phrase search engines included in the user agent to allow logging to ignore non-human users. Perhaps something like "non-interactive".
Dave - fantastic article - I forwarded it to several folks. I've always thought the web-stats packages were very squishy and this really helps puts things in perspective for me. Thanks a lot!
Great article. Thanks for posting it. I've been trying to find an explanation for the discrepancy between Webalizer and Google Analytics as we switch to GA. Now I have the answers we need. Thanks.
Great Article. I've been trying to find an explanation for the discrepancy between Urchin and Google Analytics. Now I have the answers we need. Thanks.
Read something more recent.
Statements and opinions expressed in this blog and any comments made are the private opinions of the respective poster, and, as such, iMarc LLC is neither responsible nor liable for such content.
Visitors
Great article, I've always wondered about the discrepancy, but hadn't yet dug into the details of why. We're moving most of our stuff to GA and it's good to know it most closely reflects what "real people" are doing on our sites.