iMarc | Interactive Media Architects
  • Portfolio
  • Process
  • About
  • Communiqué
  • Contact
  • Support
  • Search

Inconsistent Web Analytics Numbers: Google vs. The World

by Dave Tufts - December 22, 2008 / 3:58pm View more articles

Over the past 11 years, iMarc has used a number of web analytics tools. Whether FunnelWeb, Webalizer, Urchin, Mint, or Google Analytics, the goal is always to understand how people use the web and make optimizations based on that usage.

Recently, we've been recommending Google Analytics. Of course Google Analytics has its limitation and problems, most notably, Javascript and Cookie-acceptance is required by the end-user. That said, Google's ease of use—especially when compared to other reporting software—made it our choice for most clients.

Once we started moving clients sites from Webalizer and Urchin to Google Analytics, we were amazed at the discrepancies in traffic numbers. Google's numbers were much lower— sometimes half, 1/5th, even 1/10th the traffic that Urchin was reporting. Luckily, though the numbers were inconsistent between software packages, traffic trends were almost identical. Moral of that story: don't change reporting tools.

However, once committed to switching from a log-based analyzer like Urchin to Google Analytics, we were determined to learn more about what was causing this discrepancy.

We looked at one of our websites, picked random day, and compared the results from Google Analytics to Urchin/Webalizer (Urchin and Webalizer are different programs but their reporting numbers are almost identical). Since we have access to the raw Apache logfile, we looked at that as well.

Coincidentally, Urchin's results are almost identical to the server's logfile, but Google Analytics is by far the best gauge of true, meaningful traffic. Google Analytics' numbers may be dramatically lower, but they are much more important.

Analyzing 1 Day's Traffic

Here we see how each tool reports the same day's web traffic.

Website Visitors (or Sessions)
Raw Logfile 814
Urchin 1,036
Google Analytics 379
With the raw logfile, I pulled all unique IP addresses. A number of these IPs came back multiple times throughout the day, presumably causing Urchin's number to be higher.
Pageviews
Raw Logfile 9,579
Urchin 10,718
Google Analytics 1,672
For the logfile number, I added all requests for ".php" pages. Urchin reports .xml, .pdf, .swf, and .txt files in their pageview reports, causing the number to be higher.
Single Page Requests (search.php)
Raw Logfile 2,441
Urchin 2,440
Google Analytics 30
Here, I filtered out all pageviews except the site's search page, /search.php. Looking at these results for a single page or the previous results for all pageviews show huge discrepancies. See below for details...

This last comparison is the most telling. Both the raw logfile and Urchin report about 2,440 requests for "search.php". Why is Google Analytics only reporting 30 requests for the same page on the same day? Google seems to be under-reporting 2,411 requests.

Looking at the server log, we find exactly 2,411 requests from browsers (or User Agents) that we probably don't care about. Google Analytics filters all of these out of their reports:

  • 2,317 of the requests were from user agent, "Mozilla/5.0 (compatible; Googlebot/2.1)". This is Google, spidering our page. (On a side note, this seems like an insane amount of requests for one page on one day... I guess that's another issue I could look into)
  • 47 requests came from a user agent that doesn't identify itself. All these requests came from 3 IP addresses all resolving to the same domain, clients.your-server.de. This person (or script) probably has Javascript turned off. I'm actually glad that Google Analytics is filtering these requests out, as they're obviously not a user we care about. All this user's requests are searches for "<a" or "<script"—most likely a script looking for some vulnerabilities.
  • 44 requests came from "Twiceler-0.9 http://www.cuil.com/twiceler/robot.html". Google Analytics is filtering out requests from the new search engine, Cuil.com.
  • 1 request from Yahoo/Slurp's robot
  • 1 request from user agent, "Java/1.6.0_04"
  • 1 request from "FeedHub MetaDataFetcher/1.0 (http://www.feedhub.com)"

So Google's report of 30 requests ends up being much more meaningful than the other log analyzer's report of 2,440 requests. Everything that Google Analytics filters out is either:

  • Google's own search engine spider
  • Other search engine spiders
  • Scripts / Feedburners
  • People up to no good.

Google Analytics' focus seems a natural progression of reporting more meaningful data, even if the numbers are lower. In the 1990's it was all about hits (how much more useless can you get?), then it was pageviews, then visitors.

Now Google seems focused on reporting real people—not scripts, robots, spiders, or search engines.

While researching this discrepancy, I did notice a few instances where Google seemed to filter out real people. By following a user's path through the actual logfile, it looked like a few legitimate requests just weren't showing up in Google Analytics. In these cases, the browser's User Agent didn't identify itself. Though extremely rare, I'm guessing these requests came from someone behind a corporate firewall or someone who doesn't accept cookies and keeps their browser in its most secure state. Again, these requests were so rare, they wouldn't have affected the report much anyway.

I'll be happy switching to Google Analytics and believing their numbers represent real, meaningful traffic.

More Articles Get the RSS Feed Post A Comment

5 Comments

by Christian Madden   #
on December 22, 2008 / 8:58pm
Great article, I've always wondered about the discrepancy, but hadn't yet dug into the details of why. We're moving most of our stuff to GA and it's good to know it most closely reflects what "real people" are doing on our sites.
by Nick   #
on December 23, 2008 / 10:37am
Lesson of the day, when reporting traffic for potential ad placement..use Urchin numbers.
by Will Bond   #
on December 23, 2008 / 7:58pm
Nice write-up Dave! I was just going through a similar process on Flourish for my download counter. The counter seemed a bit higher than I expected, so I looked through the web server logs. It turned out I have quite a number of requests from Googlebot, Slurp, Java and a whole host of other search engines.

I would be really nice if there was a standard phrase search engines included in the user agent to allow logging to ignore non-human users. Perhaps something like "non-interactive".
by Ryan Capers   #
on December 30, 2008 / 2:58pm
Dave - fantastic article - I forwarded it to several folks. I've always thought the web-stats packages were very squishy and this really helps puts things in perspective for me. Thanks a lot!
by Jim Samuel   #
on June 23, 2009 / 1:19pm
Great article. Thanks for posting it. I've been trying to find an explanation for the discrepancy between Webalizer and Google Analytics as we switch to GA. Now I have the answers we need. Thanks.

Add A Comment

Accepts and renders HTML. If you include any HTML other than inline elements, you’ll also need to include your own paragraph breaks.

Statements and opinions expressed in this blog and any comments made are the private opinions of the respective poster, and, as such, iMarc LLC is neither responsible nor liable for such content.

iMarc

iMarc is a web development company in Newburyport, MA. This is our blog.
View all blogs or learn more about iMarc.

About the Author

Dave's Head Dave Tufts, Vice President of Technology
I help people build websites.
I have two daughters.
I'd rather be gardening.
More blogs by Dave

Search Our Blog

Recent Communiqués

  • Bureaucracy at the W3C
  • Clients
  • Bring Back Fun
  • Browsers and Brands
  • Getting shot in paintball is good for you
  • Hiring: Junior Web Developer, Specializing in PHP
  • Password Management Done Right
  • BOFH
  • Limits
  • Unfriendliest CAPTCHA ever
  • Debug CSS
  • Bringing Business White Papers to the Web
  • i ♥ @alaskaair
  • Micropayments
  • Beating CAPTCHA

Popular Communiqués

  • Bring Back Fun
  • Password Management Done Right
  • Hiring: Junior Web Developer, Specializing in PHP
  • Getting shot in paintball is good for you
  • Clients
  • Bureaucracy at the W3C
  • Browsers and Brands
  • BOFH
  • Limits

Recent Comments

  • Bring Back Fun

    By Robert Mohns: Go to panic.com/goods Drag a t-shirt into the "Cart" at the bottom of the screen. …

  • Inconsistent Web Analytics Numbers: Google vs. The World

    By Jim Samuel: Great article. Thanks for posting it. I've been trying to find an explanation for the discrepancy between…

  • Password Management Done Right

    By Mary: Hey Dan, great post. I've been using a VeriSign secured toolbar called Billeo to manage my…

  • Browsers and Brands

    By Reto L.: I think Rob has it right -- I just asked my mother how she gets to CNN's website and her response was…

  • Browsers and Brands

    By Robert Mohns: Actually, I think all those people who said the browser is how you search for stuff are correct. What's…

RSS

RSS Icon Learn about RSS and get the feed for our blog.

About iMarc

  • We build custom web sites
  • In-house strategy, design, programming, hosting
  • In business since 1997
  • We’re located in Newburyport, MA
  • Call us at (978) 462-8848

© 2009 iMarc LLC, Contact Us

Links

  • Home
  • Portfolio
  • Client Support
  • Log In
  • (icon)RSS

Meet the Team

Paul's Head Paul Kelley, Designer

I tweet, therefore I am.

Learn More | Meet the Others