Towards the new year and annual reports, we chose to focus on an issue that impedes data accuracy in large websites, using Google Analytics. In a nutshell: using its free version, GA limits the amount of processing it allows on top of the raw data it collects, to save bandwidth and processing.
Sampling, put simply, is a method of measuring only part of the data in order to infer a measurement over the entire set of data with improved efficiency. This method is quite reliable as long as the sampled set is representative of the entire population. Google Analytics uses sampling in some cases when presenting reports.
GA pre-aggregates the data required for all the standard reports that are accessible through the (left-hand-side) menu, so they can be generated quickly without compromising accuracy. However, in many cases you’ll want to drill down beyond the standard reports set by adding secondary dimensions to reports, applying a segment, or creating customized reports. In these cases, Google goes to the raw session data, but limits the amount of data selected to 250K pre-filtered visits, in order to speed up performance. When looking at large datasets, you might see a yellow notification on the top right-hand corner saying that the report is based on a certain percentage of the data.
In most cases, as stated above, sampling provides a reliable proxy of the full data. However, you should start being cautious with the numbers you see in the following cases:
Go for higher precision. The first thing that you can always do is double the sample base (from 250K to 500K visits) with the control icon displayed above the yellow notification (see above). This will only make the report a bit slower, but definitely tolerable.
Measure shorter date ranges. When reporting over a large data range (some of you might be running your annual reports soon…), you can reduce sampling considerably by running your reports over smaller segments (e.g. one per month) and then using excel or other tools for adding up.
Structure your GA account. Standard reports are unsampled for any profile. Therefore, if you have certain segment of data that you look at a lot (e.g. new visitors, organic traffic), create specific views (profiles) for them. However, this will not help you with non-standard reports, which are based on raw data from the entire property (everything that’s tracked with the same UA-xxxxx-yy code). Therefore, if you have large numbers of visitors, use separate properties for your website, apps, blog, etc.
Consider using multiple trackers. In some cases you’ll want high accuracy for a subset of your data (e.g. registered users activity) while still being able to see it within a consolidated view of all visitor activity. In cases like this, you can use an additional tracker just for those subsets. Another use case for this method, is that if you’re tracking your website and mobile apps under separate properties, you can use an additional tracker for sending all of your data to a Universal web property for measuring cross-platform user behavior.