From the course: Learning Data Science: Understanding the Basics
Start out with descriptive statistics
From the course: Learning Data Science: Understanding the Basics
Start out with descriptive statistics
- Data science teams will spend most of their time collecting, scrubbing, and storing data. Then they use the data to ask interesting questions. They create reports using statistics and math to see if they get some new insights. Statistics is a very interesting field. To participate in a data science team, you'll need some basic understanding of the language. One thing to remember is that statistics are tools that help tell a story. But they're not, in themselves, the end of the story. The best way to tell how much of a story you're not getting, is by pushing back when things don't seem right. My son once told me a great joke about this. It shows how teams can use statistics to tell stories. He started out by saying, "Do you know why "you don't ever see elephants hiding in trees?" I shrugged and said, no. He said, "Because they're really good at it." Try to remember this joke when you look at your reports. People usually think of statistics as concrete mathematics. Who would question whether or not two plus two equals four? In truth, statistics is much more like storytelling. Like any story, it can be filled with facts, fictions, and fantasy. You can hide some pretty big elephants if you don't know where to look. One place you'll see this is with politics. One representative may say, over the last four years, the voter's average salary has gone up $5,000 dollars. People will clap and cheer. Then the challenger will say that they shouldn't be clapping because actually, over the last four years, the typical middle-class family now earns $10,000 dollars less. So who's telling the truth? Well, both of them are. They're just using statistics to tell a different story. One story talks about prosperity, and the other talks about failure. Both of them are true, and yet, neither of them is telling the whole truth. Ya have to look for the elephants in each of these stories. In this case, each representative is using descriptive statistics. They're trying to describe how all the voters are doing without having to talk about each family. They're creating a story of a typical family. One representative uses something called the mean, which is essentially an average. They add up all the income for each family then divide it by the total number of families. This is one of the most useful and popular descriptive statistic. You can use it with grade point averages, sports statistics, estimated travel times and investments. In this example, let's say that the representative added up the income of every family. Then they divided it by the total number of families. Sure enough, each family earned about $5,000 dollars more. But hold on, the mean is not the only way to describe a typical family. The competing representative has another way. They used the median family income. The median describes what a family in the middle of a distribution earns. To find this, you take all of the families and then rank them from lowest to highest. Then you number them from top to bottom. Then you find the number in the middle but dividing the ranking in half. The family in the middle has the median income. When you see this, remember to look for that elephant. When there's a big variation between the median and the mean, it usually means that your data is skewed. In this case, imagine that a few families are extremely wealthy. In the last few years, their income may have gone up substantially. Maybe this accounts for millions of added dollars. These families skew the data because there's a big chunk of money at the top. That would increase the mean but wouldn't impact the median. In the mean, their income would be added up like everyone else's and factored into the average. In the median, they would just be at the top of the ranking. However, since the number of families didn't change, neither would the income for the family at the middle point. You see, this challenge with the median and the mean in other ways as well. If there are two people standing in a room, their mean height might be just under six feet. If a basketball player walks into the room, then their mean might grow a foot taller. The median height would stay roughly the same but the group would skew tall. On your data science team, don't be afraid to ask questions when you see stories using statistics. Make sure you review to see how certain claims are made. Also, try to make sure that your reports use different ways to describe the data. Look out for that elephant. Remember that statistics can tell many different stories.