On the surface, this might appear as somewhat of a contradiction.
This approach might be surprising, but it’s a necessary strategy when dealing with the vast scale of data that Google handles daily.
In Summary
Illyes’ explanation reveals a deliberate trade-off: speed and efficiency over perfect accuracy.
Illyes begins his response:
The question was, “Why is filtered data higher than overall data on Search Console, it doesn’t make any sense.”
Basically, Bloom filters speed up lookups by predicting if something exists in a data set, but at the expense of accuracy, and the smaller the data set is, the more accurate the predictions are.”
Speed Over Accuracy: A Deliberate Trade-off
This allows faster but less accurate analysis, Illyes explains:
When you handle a large number of items in a set, and I mean billions of items, if not trillions, looking up things fast becomes super hard. This is where Bloom filters come in handy.”
In the latest installment of Google’s monthly office-hours Q&A session, a question was asked regarding the higher volume of filtered data compared to overall data in Google Search Console.
The expectation is that overall data should be more comprehensive and, therefore, more extensive than any filtered subset.
This trade-off is intentional. Google cares more about speed than 100% accuracy. The minor inaccuracies are worth it to Google to analyze data rapidly.
“Since you’re looking up hashes first, it’s pretty fast, but hashing sometimes comes with data loss, either purposeful or not, and this missing data is what you’re experiencing: less data to go through means more accurate predictions about whether something exists in the main set or not, and this missing data is what you’re experiencing: less data to go through means more accurate predictions about whether something exists in the main set or not.
Bloom filters allow Google to work with trillions of data points, but they sacrifice some accuracy.
Filtered data can be higher than overall data in Search Console because Google uses bloom filters to quickly analyze vast amounts of data.
So, it’s not a mistake to see that filtered data is higher than overall data. It’s how bloom filters work.
Yet, this isn’t what users are experiencing. What’s going on here?
Search Console & Bloom Filters
Featured Image: Tetiana Yurchenko/Shutterstock