From the course: Data Wrangling in R
What is tidy data?
- [Instructor] The goal of this course is to help you use R to transform your datasets into a consistent format known as Tidy Data. You do this through a process known as data wrangling. Now data wrangling is the process of taking messy data and manipulating it into a format that's well suited for analysis. It goes by many other names. Some people call it data cleaning, data munging or data preparation. Now whatever name you choose to use, it's important to remember that this is not a one-time task. While it's true that most data projects will involve a lot of data wrangling upfront, data wrangling is a continuous process, and as you encounter new datasets, new problems and new ideas during the course of your project, you'll likely return to perform some new data wrangling. Now the term Tidy Data describes data that's been put into a standardized format that facilitates future analytic work. Hadley Wickham, a data scientist who is one of the key developers of the R language coined the term Tidy Data in this paper that he published in the Journal of Statistical Software back in 2014. And throughout this course, I'll refer back to the principles that Wickham outlined in this paper as it's considered one of the most important works in the field of data wrangling. I encourage you to go back and read the paper yourself after you complete this course. You'll find that it is full of examples that help illustrate the concepts of Tidy Data. Now one quick word of warning. The Tidyverse is rapidly evolving. Some of the material that I cover in this course is more recent than that covered in the paper. Converting data from its original format into Tidy form is difficult, time-consuming work, but why would we want to spend this time and effort required to create Tidy Data? Well, there were three reasons. First, Tidy Data facilitates initial exploration and analysis. If our data is in a standardized format, it's much easier to notice trends, anomalies and other important features of datasets. Tidy Data also improves our ability to collaborate with others. If our data is in standard formats, we can easily share it with other people who will then be able to quickly begin analyzing it without having to go through their own data wrangling work first. Finally, if we convert our data to Tidy format, we can take advantage of many R packages that accept Tidy Data as input without performing additional transformations. That sounds great, right? But the trick is that while Tidy Data has a consistent format, you'll need to figure out how to convert your existing data into that format. Wickham summed it up best in his paper by quoting Tolstoy, who said, "Happy families are all alike, "but every unhappy family is unhappy in its own way." Wickham drew the parallel to Tidy Data by saying Tidy Data are all alike, but every messy dataset is messy in its own way. Your job in wrangling data is to develop an understanding of your unique datasets and figure out how to use data manipulation tools in R to properly structure it as Tidy Data. Once you've done that, a whole world of data analysis tools becomes available to you. Tidy Data unlocks a set of tools known as the Tidyverse. Now the Tidyverse consists of a set of R packages that work together to transform, analyze and visualize Tidy Data. The Tidyverse packages can easily share data among each other and they allow you to quickly take advantage of the power of R.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.