As you undoubtedly know already, as of the 1st July 2023, Universal Analytics will be sunsetted, and GA4 (Google Analytics 4) will fully take over. This blog post delves into the reasons behind GA4’s adoption of estimated totals, how Google employs the estimation process, and what this means for users transitioning to GA4. Additionally, we explore the implications of this approach and shed light on the accuracy of future total values in Google Analytics.
Estimated values in GA4
One of the great things about Universal Analytics was that as a user you got so much value from Google for free. Within GA4 there seems to be a move to rebalance the cost value proposition by driving down the computing power needed to process the huge datasets that Google Analytics produces for free and at the same time improve the performance of dashboards within the interface. One surprising way Google are doing this is to lean on estimated values rather than providing totals, in fact, in Google Analytics 4 (GA4), totals are estimated by default.
Why does GA4 use estimated totals?
There are a few reasons why GA4 uses estimated totals. First, it is more efficient to estimate totals than to count every single user or event. Second, estimating totals allows GA4 to provide more accurate results for large datasets. Third, it moves the dial in performance and value as estimating totals allows GA4 to be more scalable, meaning that it can handle more data without becoming too slow or expensive.
The cost to the user is that values you see in the reports are not the exact number of users or events. This means that reconciliation of reporting is going to become more challenging as we all move across to GA4 from 1st July.
How does Google estimate the value?
Google produces the estimated value using the HLL++ (Hyperloglog Plus Plus) algorithm. HLL++ samples a subset of the data, the size of the sample is determined by the desired accuracy and the sampled data is then used to estimate the entire dataset. Google suggests that you make use of the Data API to retrieve the raw data and there is an implementation of HLL++ available in BigQuery that you can apply to your exported data to get similar results to the total estimated in the dashboard. Differing inputs mean the result will still not be exactly the same. (BigQuery limits sparse precision to precision +5 while GA4 uses as much as +11) and when doing analysis outside of GA4 you must be careful how you segment the data for export as Google limits the data that can be extracted and may return different values depending on the level of granularity. It also depends on how you configure identity, as values such as total user and session counts can differ between the interface and exported data if you are using Blended or Observed identity models rather than ‘By device only’.
Are estimated values a good thing?
What should reassure you is that you might already have been seeing these sorts of estimated values in Universal Analytics for users. If you have turned on the Enable Users Metric in Reporting (off by default for standard GA accounts, on by default for GA360), then in reports as far back as 2016 you have been seeing estimated values for users with error rates of around 2%. Applying this to sessions is new and will present its own challenges, but it is not without some precedent.
How accurate will my totals be on Google Analytics in the future?
So how accurate will your totals be in future? It seems that this is going to be a contextual question, the accuracy of the estimate produced by HLL++ depends on several factors, including the size of your dataset, the sampling method used, and the level of accuracy Google has pegged their algorithm at. It also depends on how accurate you need it to be, with the option to invest the time and effort solving the problems in exporting and analysing data.
If you have any questions as to how GA4 will impact your reporting, or how this will impact your dashboard and reporting in the ASK BOSCO platform, please drop us an email at email@example.com