From the course: Fundamentals of Data Transformation for Data Engineering

Codespaces and setup

- [Instructor] So here we're going to talk about how to set up the course, and we'll be using GitHub Codespaces so it should be pretty straightforward. So what is GitHub Codespaces? Well, GitHub Codespaces is a containerized environment. That means every experience is identical. You can think of Codespaces like a separate virtual machine, and Docker means that everyone's environment will be the same. That means that the interactions with this course can take place entirely in the browser, regardless of what kind of browser you're using through GitHub. And if you want to run the code locally, you can. You can just pull it down, create a new virtual environment. I have all of the files configured in the course itself to use it either a virtual environment or Docker, but we won't be going into depth on how to set that up. We'll let you do that if that's what you desire though. And finally, code will always be available for you to reference. That means when you're done tinkering with this course, learning, playing, you won't have to go back and manually make sure that you save this off somewhere. You'll always be able to reference the GitHub repo, and if you pull this down locally, it'll always be on your device. I really encourage you to play with the code to tinker, but then also know that you'll always be able to reuse the snippets in your own personal projects if you'd like to or reference them for projects that you do in the future. Next, we'll talk a little bit about the dataset that we'll be using, and then we'll jump into GitHub Codespaces where we'll actually configure the environment and you can follow along. So for this course, we used a dataset from the US National Parks Service API, and that means we'll be focused on the idea of parks and campgrounds as relational data, and Parks is a table, Campgrounds is a table, but we'll also dig into alert data for some time series experience. And the great thing about this dataset is that you can explore the dataset on your own too. So every table in this dataset, every endpoint in that API, will be saved off in our database, and I'll show you how you can explore that data and play around, select, or query the data on your own in the next video. So, we'll jump right into a video walkthrough of how you can set up GitHub Codespaces and get started with the course. So, this is the course GitHub Repo, and everything's pretty straightforward. All of the course data will be contained in this course folder, and on the front of the repo we have a README that explains the overview of the course, that talks about why we're learning what we're learning, my philosophy around learning, and learning through doing, as well as walks through the course structure if you forgot from the last video. There's also a repo structure at a high level if any of that is confusing as well as a getting started guide and a guide to running code and the repo walkthrough. So this all should be pretty straightforward. To create a new Codespaces, we're going to click this Code button and select Create codespace on main. What that's going to do is create that container that I mentioned and pull in all of these files into a virtual VS Code environment. So, you'll see the screen when you click that button. It should open in a new tab. If it doesn't, maybe check if your popups allow that sort of thing or your browser is configured in such a way that that might not be possible, but hopefully this is the screen you'll see next, and you should notice that on the left there we have all of the files that were present in the GitHub version of the course previously. So if you click Course, you'll see the different sections of the course, an intro and then our two core lessons on SQL and pandas as well as an appendix for next steps and continuing your data journey. Now you'll notice that in the terminal there's a updateContentCommand running. This is configuring the environment for your course, so once you run the code space, you'll want to wait for that command to finish, and when that command finishes, you'll just see a terminal and a blank command line. I'll highlight that once it's done. But once that command is done running, the environment will be entirely configured, every package you need will be installed there. You can see that those commands are done running. Every package you need will be installed, and you should be able to jump right in. So I'm going to show you how to run SQL cell in this environment just to kick things off, and then we can also talk about how to run a pandas cell too. So, clicking SQL, there are two folders, exercises and lessons. You'll get started with the lessons folder, and we can just jump into that first lesson, duckdb-basics. For each lesson I included some helpful notes just to help you work through these notebooks if you want to in the future without the videos or just to remind you exactly what's going on. I probably won't cover those in the videos though. So this cell you'll see present at the beginning of every DuckDB notebook here, and this just initializes the notebook in our environment. So what we're doing is importing the library, import duckdb, we're loading the SQL extension for Jupyter, which allows us to run SQL in our Jupyter Notebooks. This percent symbol is what's called a magic command, so that just lets us run something that's not code in the notebook, that does something a little special basically. Next, we're initializing our connection with the DuckDB database, which will run in memory on this virtual machine, and then last, we're connecting to the database and importing the actual data we'll be using for the code. So if I run this, we'll get a popup the first time asking us which Python environment we want to use. There's only one Python environment in this container, but if you're running this locally, you'd want to create your own. So I'll select the only Python environment on the list, and you'll notice that the environment's connecting to the kernel and then executing the cell. It should load the data in five or six seconds. So if you see this message, everything worked as it should, and the count is the number of tables that we have. Navigating to the next cell, we can then interact with the data, so Select *FROM nps_public_data.parks is going to pull in the first row from our parks data, which we can see is a national memorial called Federal Hall. At any time, if you want to examine what other data lives in the course, you can just type in DESCRIBE after loading everything else, and we're going to describe the dataset that's loaded into memory. So you can see here we have our schema nps_public_data and a bunch of tables, so park hours, parking lots, parks, et cetera. Once you run this command, it kind of gives you a blueprint for what you can then query. For example, if I wrote Select *FROM nps_public_data.campgrounds LIMIT 1, we'll get data from the Campgrounds table, which is also loaded in our dataset. And the way this works, if you'll note in the earlier cell, is that there's actually a data directory in our course with this nps dataset. And these are just Parquet files, and you can think of Parquet like a compressed CSV file. So, when we run this initial command, we're actually loading this entire data directory into memory and we're able to query these as though it was a database, a schema, and some tables, and that's really cool. That's what DuckDB lets us do, and that's how it's making this course more powerful. So that's it for how to run a SQL cell. So if you'd like to run pandas cells, the process is just as simple, maybe even a bit more straightforward, and if we head on over to the pandas section of our course lessons, the first pandas lesson where we're loading this dataset will be in Lesson One, and if I open that up, we're going to import pandas as pd. Again, we have to select that kernel when we load a new notebook, and you'll notice that in this instance we're actually loading a individual file, not the entire database, so we'll be using the pandas function read_parquet, and then selecting the data that we need to select, and you don't have to worry because this is already pre-read in all of the lesson and exercise notebooks, so it'll be as simple as running this cell. In this cell we're loading the nps_public_data_parks dataset and displaying the first five rows of that dataset. One command and we're all set. So that's how you'll load data using pandas and DuckDB in this course, it's really straightforward, and now it's time to get into things and start building with our dataset.

Contents