English 中文(简体)
Basic site analytics doesn t tally with Google data
原标题:

After being stumped by an earlier quesiton: SO google-analytics-domain-data-without-filtering

I ve been experimenting with a very basic analytics system of my own.

MySQL table:

hit_id, subsite_id, timestamp, ip, url

The subsite_id let s me drill down to a folder (as explained in the previous question).

I can now get the following metrics:

  • Page Views - Grouped by subsite_id and date
  • Unique Page Views - Grouped by subsite_id, date, url, IP (not nesecarily how Google does it!)
  • The usual "most visited page", "likely time to visit" etc etc.

I ve now compared my data to that in Google Analytics and found that Google has lower values each metric. Ie, my own setup is counting more hits than Google.

So I ve started discounting IP s from various web crawlers, Google, Yahoo & Dotbot so far.

Short Questions:

  1. Is it worth me collating a list of all major crawlers to discount, is any list likely to change regularly?
  2. Are there any other obvious filters that Google will be applying to GA data?
  3. What other data would you collect that might be of use further down the line?
  4. What variables does Google use to work out entrance search keywords to a site?

The data is only going to used internally for our own "subsite ranking system", but I would like to show my users some basic data (page views, most popular pages etc) for their reference.

最佳回答

Under-reporting by the client-side rig versus server-side eems to be the usual outcome of these comparisons.

Here s how i ve tried to reconcile the disparity when i ve come across these studies:

Data Sources recorded in server-side collection but not client-side:

  • hits from mobile devices that don t support javascript (this is probably a significant source of disparity between the two collection techniques--e.g., Jan 07 comScore study showed that 19% of UK Internet Users access the Internet from a mobile device)

  • hits from spiders, bots (which you mentioned already)

Data Sources/Events that server-side collection tends to record with greater fidelity (much less false negatives) compared with javascript page tags:

  • hits from users behind firewalls, particularly corporate firewalls--firewalls block page tag, plus some are configured to reject/delete cookies.

  • hits from users who have disabled javascript in their browsers--five percent, according to the W3C Data

  • hits from users who exit the page before it loads. Again, this is a larger source of disparity than you might think. The most frequently-cited study to support this was conducted by Stone Temple Consulting, which showed that the difference in unique visitor traffic between two identical sites configured with the same web analytics system, but which differed only in that the js tracking code was placed at the bottom of the pages in one site, and at the top of the pages in the other--was 4.3%


FWIW, here s the scheme i use to remove/identify spiders, bots, etc.:

  1. monitor requests for our robots.txt file: then of course filter all other requests from same IP address + user agent (not all spiders will request robots.txt of course, but with miniscule error, any request for this resource is probably a bot.

  2. compare user agent and ip addresses against published lists: iab.net and user-agents.org publish the two lists that seem to be the most widely used for this purpose

  3. pattern analysis: nothing sophisticated here; we look at (i) page views as a function of time (i.e., clicking a lot of links with 200 msec on each page is probative); (ii) the path by which the user traverses out Site, is it systematic and complete or nearly so (like following a back-tracking algorithm); and (iii) precisely-timed visits (e.g., 3 am each day).

问题回答

Lots of people block Google Analytics for privacy reasons.

Biggest reasons are users have to have JavaScript enabled and load the entire page as the code is often in the footer. Awstars, other serverside solutions like yours will get everything. Plus, analytics does a real good job identifying bots and scrapers.





相关问题
How to divide a search query into sub queries?

I am just wondering if there is an algorithm that can divide a user input query for a search engine into a set of sub queries. for example if the entered query is "plcae to stay and eat" the sub ...

Is there a lighter version of Google Analytics for Flash

40k of compiled code seems like a lot to me to be making some straightforward flash-javascript calls and makes GA unsuitable for banner ad work as well. Does anyone know if there is a lite version ...

PHP GET question - calling from a POST call

I have a quick question i hope you guys can answer, i ve got a search system that uses POST to do searches, now i want to track queries using Google Analytics but it requires using GET url parameters ...

Google Analytics to track FireFox extension use

I m developing a Firefox extension and would like to track its use with google analytics, but I can t get it working. I ve tried manually calling a function from ga.js, but that didn t work for some ...

Google Analytics _trackEvent troubles

I m having some noob troubles with Google Analytics and _trackEvent. Using it seems straight forward in the documentation, but I can t get this simple example to work. The call to _trackEvent fails ...

How to Add script codes before the </body> tag ASP.NET

Heres the problem, In Masterpage, the google analytics code were pasted before the end of body tag. In ASPX page, I need to generate a script (google addItem tracker) using codebehind ClientScript ...

热门标签