From the course: Learning Data Science: Understanding the Basics
See how correlation does not imply causation
From the course: Learning Data Science: Understanding the Basics
See how correlation does not imply causation
- Correlation is a great tool. It will help you see relationships that you might not otherwise see. Yet there's a flip side. You have to see whether the correlation caused what you're looking at. Was it the correlation that caused the change? As a general rule, correlation doesn't imply causation. That means that a relationship between two things might be affected by a third thing that isn't part of your analysis. It's a big challenge for data science teams to figure out causality. You don't want to create relationships that don't exist. Think of it this way. I grew up in one of the colder areas of the country. When my parents got older, they moved to southern Florida. They now happily live in a sunny retirement community called The Vistas of Boca Lago. Every few months our family goes down to visit. Yet, statistically, their community is one of the most dangerous places on Earth. Every time we visit there are people being hospitalized or worse. There's a strong correlation between their community and death or severe injury. You'd think that because of this I'd never visit my parents. It sounds like the opening scene of every first-person shooter video game. Yet, I see past this correlation. We visit them often and it feels perfectly safe. It's because the correlation doesn't imply causation. The true cause is that the median age is much higher. Older people in a retirement community have a much higher probability of injury or death. If you looked at this correlation, you'd think that they lived in a war zone. You'd never imagine them peacefully playing Mahjong by the pool. Think about how your data science team might also apply these concepts. Let's go back to our running shoe website. Imagine that the team identified that there was a big increase in sales in January. There's a strong correlation between January and the number of people buying new shoes. The team gets together to understand the cause. They ask some interesting questions. Do people have more money in January? Why are more people running during the coldest months? Are these first time runners motivated by their New Year's resolutions? Are they new customers? What kind of shoes are they buying? The team discusses the questions and decides to create reports. The reports suggest that most of these customers are new customers buying expensive shoes. Because of these reports, the team feels comfortable that the cause of the new sales is that new customers have more money in January. Maybe they received gift cards or credit from other stores. The following year the team decides to take advantage of this causation. In December they offer holiday gift cards. They also send promotions to last year's new customers. A few months later the team looks over the data. They find that their promotions and discounts had no impact. Roughly the same number of people bought the same number of shoes. It seems that having more money wasn't the cause of the correlation. The data science team went back to their original questions and ran a few more reports. They found that all the new sales for both years were for new customers and first time runners. Why would there be a burst of new customers buying expensive running shoes during the coldest months? The team thought about it and considered the reason might be behavioral. They posed a new question. Are all the new customers people who are trying to get in shape because of a New Year's resolution? The next year they decided to create a new promotion. It was geared around New Year's resolutions. They sent out a mailer that said, "Do you want to keep you New Year's resolution?" It offered free running guides and fitness trackers as a way to keep people interested throughout the year. Correlation and causation are key challenges for most data science teams. There's real danger that you can create false relationships. In statistics, this is called a spurious causation. As you can see, finding the real cause will give you much more value. The best way to avoid spurious causation is by following the scientific method. Remember to ask good questions and be clear-headed about your results.