The Bug in Google’s Flu Trend Data

First published March 20, 2014 in Mediapost’s Search Insider

Last year, Google Flu Trends blew it. Even Google admitted it. It over predicted the occurrence of flu by a factor of almost 2:1.  Which is a good thing for the health care system, because if Google’s predictions had have been right, we would have had the worst flu season in 10 years.

Here’s how Google Flu Trends works. It monitors a set of approximately 50 million flu related terms for query volume. It then compares this against data collected from health care providers where Influenza-like Illnesses (ILI) are mentioned during a doctor’s visit. Since the tracking service was first introduced there has been a remarkably close correlation between the two, with Google’s predictions typically coming within 1 to 2 percent of the number of doctor’s visits where the flu bug is actually mentioned. The advantage of Google Flu Trends is that it is available about 2 weeks prior to the ILI data, giving a much needed head start for responsiveness during the height of flu season.

FluBut last year, Google’s estimates overshot actual ILI data by a substantial margin, effectively doubling the size of the predicted flu season.

Correlation is not Causation

This highlights a typical trap with big data – we tend to start following the numbers without remembering what is generating the numbers. Google measures what’s on people’s minds. ILI data measures what people are actually going to the doctor about. The two are highly correlated, but one doesn’t not necessarily cause the other. In 2013, for instance, Google speculated that increased media coverage might be the cause for the overinflated predictions. More news coverage would have spiked interest, but not actual occurrences of the flu.

Allowing for the Human Variable

In the case of Google Flu Trends, because it’s using a human behavior as a signal – in this case online searching for information – it’s particularly susceptible to network effects and information cascades. The problem with this is that these social signals are difficult to rope into an algorithm. Once they reach a tipping point, they can break out on their own with no sign of a rational foundation. Because Google tracks the human generated network effect data and not the underlying foundational data, it is vulnerable to these weird variables in human behavior.

Predicting the Unexpected

A recent article in Scientific American pointed out another issue with an over reliance on data models –  Google Flu Trends completely missed the non-seasonal H1N1 pandemic in 2009. Why? Algorithmically, Google wasn’t expecting it. In trying to eliminate noise from the model, they actually eliminated signal coming during an unexpected time. Models don’t do very well at predicting the unexpected.

Big Data Hubris

The author of the Scientific American piece, associate editor Larry Greenemeier, nailed another common symptom of our emerging crush on data analytics – big data hubris. We somehow think the quantitative black box will eliminate the need for more mundane data collection – say – actually tracking doctor’s visits for the flu. As I mentioned before, the biggest problem with this is that the more we rely on data, which often takes the form of arm’s length correlated data, the further we get from exploring causality. We start focusing on “what” and forget to ask “why.”

We should absolutely use all the data we have available. The fact is, Google Flu Trends is a very valuable tool for health care management. It provides a lot of answers to very pertinent questions. We just have to remember that it’s not the only answer.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.