From the course: Fundamentals of Data Transformation for Data Engineering

Why SQL? Why Python? Why not Spark?

- [Instructor] In this course, we'll be concerned with two major languages for data transformation, and that is SQL, or "sequel," and pandas. But I think it's important that we take some time and discuss why we're focused on those two languages as well as what else exists and what's likely to be popular in the future. So why SQL? Well, SQL is time tested and databases have been around since 1970, and that means so has SQL. So this is probably one of the only, if not the only, programming language that people have been using consistently, or variance of consistently, for the last 50 plus years. And SQL is excellent for relational data. The way the language is defined makes it a great choice for joining, unioning, and combining data. And most data is relational today, though that's quickly changing as we'll discuss in a later exercise. But SQL also runs at scale, and that means from a thousand rows to a billion rows, you can be using SQL or a SQL variant to select and join that data with minimal switches in syntax. SQL accents Python well, sort of like a hammer and a screwdriver. You use a hammer for certain tasks like pounding a nail and a screwdriver for others. SQL is really good at certain data manipulations, as we'll discuss in this course, where Python falls short, and the opposite is true as well. Python's really great at certain things that SQL does not excel at. So lastly, SQL's a standard, and if you work as a data analyst or a data engineer, you will need to know how to write SQL, and that's really what makes it important. Everyone that works in this field knows how to use it, and it's sort of a network effect. That means that the value of understanding SQL is increased by the fact that most professionals use this language, too. So that's why we use SQL. But what about Pandas? Well, pandas is actually a subset of Python. Pandas is its own library, as I'm sure you're familiar, but it's also time tested, and it's been the tool of choice for about 15 years now, which, in Python's lifetime, is quite a while. And I think one of the underrated aspects of Python is that there's a ton of community support, and everything that you need to know about the pandas library, you'll be able to find online through Stack Overflow, the pandas documentation, or ChatGPT or other AI models since those are indexed on existing data. Pandas is also incredibly flexible, and because Python is a relatively straightforward language, that means you can do many things very simply, which is important for being able to transform data and iterate on the transformations that you write. Lastly, pandas is really well suited to handle unstructured data as we'll see in some of our upcoming lessons. Now, I would be remiss if I didn't mention the pitfalls associated with these two languages, of which there are many, but we'll focus on a couple. And it's important to remember that there's no such thing as a perfect tool, but we're trying to get as close as possible, and in data transformation, that's why we're using two languages. But with SQL, expressions can be limited. You can't really even think about SQL as a programming language. There isn't a concept of variables or modules or many of the other more complete functionalities that other programming languages like Python or JavaScript contain, and that means that it has a very narrow scope. It can also have a pretty steep learning curve. People always get tripped up by window functions, which we'll discuss a little bit in this course, but SQL can be intimidating at first, especially some of the more advanced concepts. And admittedly, it's not the easiest to read. Now, pandas has its own faults. It doesn't perform super well on big datasets. It requires a lot of memory, and, at times, it can have an odd syntax and also a steep learning curve. And you'll notice that these both can have a steep learning curve, which means that a course on data transformation might be challenging. And we're going to tackle that as best as we can. We're going to make this as approachable as possible, but it's important to note that many of these concepts, or the way that you do things in these two languages, might seem foreign at first, and that's okay, but we're going to take it step by step, and we'll tackle it together. So it's important to note what else is out there because there are some new, exciting trends in technology and data transformation, but they're mainly concerned with distributed, compute, or different languages. So frameworks like Polars, Rust, Ray, Dask, they're primarily performance-oriented and they have a lot of promise, but they're not as popular as pandas or plain Python. And for SQL, new languages and frameworks or new syntaxes, basically new databases, require different variants of SQL, and maybe even the promise of AI writing your SQL for you are on the horizon. Unstructured data sources also pose a bit of a challenge for SQL workflows, but newer languages tackle those sources really well, DuckDB being one of those as we'll discuss. And if we think about the history of programming languages, or even just go back in the last 20 years, Python has been the most dominant language for a majority of that, and while newcomers like Rust show a lot of promises for data transformation, the network effect of popularity, both within data infrastructure and also among practitioners using the language in their job, is really important for community support, learning and development, and being able to build the best data systems possible. And that's why we'll be using SQL and Python in this course.

Contents