📝 We surveyed 287 data practitioners across multiple regions, roles, and perspectives, to explore how software engineering principles, GenAI, and organizational requirements are shaping data quality management. In this 8-minute read, discover: - how teams are managing data quality today - if GenAI will disrupt data quality workflows and tooling - the capabilities that practitioners say are essential to enhance data quality Are you curious to see what your peers think about the future of data quality with AI? 👉 Read the survey findings here: https://lnkd.in/e_sDVah5 Thank you to everyone that participated in our survey!
Soda
Software Development
Brussels, Brussels 8,744 followers
Test and deliver data that everyone can trust.
About us
Say goodbye to data issues and hello to trust with Soda, the data quality platform built for the modern data stack. Our mission is simple: to help data teams catch, prevent, and resolve data issues before they wreak havoc downstream. With Soda, you can embed powerful data quality checks right into your data stack and systems. Ensure that the data you deliver is accurate, reliable, and trustworthy. And thanks to SodaGPT, our groundbreaking generative AI for data quality, creating production-ready checks has never been easier. Simply ask Soda in plain English, and it will translate your queries into SodaCL, a user-friendly language for data quality. Join the over 100 data-driven organizations worldwide, including HelloFresh, Group 1001, Lululemon, Panasonic, Zefr, and Zendesk, who rely on Soda to power their data quality. From preventing bad data merges to aligning with data consumers and integrating seamlessly into your data stack, Soda will enable your team to test and deliver data that everyone can trust.
- Website
-
https://soda.io/
External link for Soda
- Industry
- Software Development
- Company size
- 11-50 employees
- Headquarters
- Brussels, Brussels
- Type
- Privately Held
- Founded
- 2018
- Specialties
- data quality, data management, data science, data engineering, data monitoring, data observability, data testing, data reliability, data product management, data analytics, data products, data mesh, data, and data pipeline management
Products
Soda Data Quality
Data Quality Software
Soda’s mission is to empower and incentivize everyone in your organization to share and use reliable, high-quality data. Soda delivers end-to-end data quality management to detect, analyze, and prevent data issues. It's easy to integrate the Soda framework into your data stack by leveraging our extensive Python and REST APIs, to add data quality tests to your pipelines to avoid merging bad-quality data into production and to prevent downstream issues. For any organization that is building a domain-oriented, decentralized data platform to drive data cultural change, Soda bridges the gap between data producers and data consumers, whilst increasing accountability and ownership to ensure that the right people are involved at the right time. Over 100 organizations including American Family Insurance, CarTrawler, Group 1001, HelloFresh, Lululemon, and Panasonic, use Soda to guarantee the accuracy, validity, and consistency of data at every stage of its lifecycle.
Locations
-
Primary
Brussels, Brussels 1000, BE
-
New York, NY, US
Employees at Soda
Updates
-
👋 Calling all Data Engineers: learn how to integrate data quality checks into both development and production pipelines using Dagster in this technical showcase. 🖥 Our Customer Engineer Hazem El-Dabaa will demonstrate how to execute several Soda Checks Language (SodaCL) checks at multiple points within your Dagster pipeline. ➡ Key takeaways include how to: * implement SodaCL checks to ensure data quality at various stages of your pipeline * automate and streamline your ETL process with Dagster and dbt * deliver trusted data for downstream analytics, minimizing data quality issues in production Join on LinkedIn Live or register here for Zoom: https://lnkd.in/eyYQc7G5
How to Test Data Quality in a Dagster Pipeline
www.linkedin.com
-
🆕 Data Engineering in Action: Building a Self-Service Data Quality Platform on Amazon Web Services (AWS) with Soda Core Join data engineer Mohamed Dyab on Wednesday, 11 December, as he shares his insights and lessons learned from building a custom data quality platform at Air Liquide. Discover how the team tackled several critical challenges, including: * the build-vs-buy dilemma * translating business requirements into actionable data quality checks * customizing the user experience for data owners, consumers, and engineers * streamlining operations without adding complexity 🗓️ Wednesday 11 December 🕘 9am Eastern | 3pm Europe 📍 Online
Building a Self-Service Data Quality Platform on AWS with Soda Core
www.linkedin.com
-
Join us to see how integrating Soda with Microsoft Azure Data Factory can help you maintain reliable data pipelines and improve the overall quality of your data operations. 🗓️ Thursday 5 December 🕔 11:00 am Eastern | 5:00 pm Europe 📍 Online 👨💼 Tyler Adkins Key takeaways include how to: * use Soda to validate and reconcile data across your pipeline to ensure consistency * add critical data quality checks after ingestion to catch issues early in the pipeline * create detailed data visualization reports to gain insights into data quality trends and issues * efficiently review data quality check results and take action to address issues early
How to Test Data Quality in an Azure Data Factory Pipeline
www.linkedin.com
-
👋 Hey Data Engineers, we're full throttle on data pipeline testing this month 🪂 and we've got something for every data engineer. 🔹 How to Test Data Quality in Microsoft Azure Data Factory Pipelines 🗓️ December 5, 2024 | 11:00 AM ET | 5:00 PM CET 🖇️ https://lnkd.in/eHV3g2Hk 🔹 How to Build a Self-serve Data Quality Platform on Amazon Web Services (AWS) 🗓️ December 11, 2024 | 9:00 AM ET | 3:00 PM CET 🖇️ https://lnkd.in/ep58vMcX 🔹 How to Test Data Quality in Dagster Labs Pipelines 🗓️ December 12, 2024 | 11:00 AM ET | 5:00 PM CET 🖇️ https://lnkd.in/eyYQc7G5 🔹 Tutorials: Adding Data Quality Checks to Data Pipelines Explore step-by-step guidance on integrating Soda into Databricks, Apache Airflow, and more to manage reliable pipelines at scale before migration and during development for error-free operations. 🖇️ https://lnkd.in/dr3XSzU9
-
❓Can You Democratize Data Access Without Compromising Security? 🤔 Every organization wants to enable teams with data, but how do you provide access while ensuring security, quality, and accountability? At the heart of data democratization lies a critical question: How can you ensure everyone has the data they need without risking sensitive information or burdening your data team? The answer? Smart access management. * Prevent accidental overwrites and maintain trustworthy data. * Increase transparency with tailored access permissions. * Enable self-service access for business users—so they get what they need, when they need it, without bottlenecks. With Soda Cloud’s enhanced role-based access controls (RBAC), you can: ✅ Customize roles and permissions to match organizational needs. ✅ Automate user management with Identity Provider (IdP) integration. ✅ Bulk edit access rights for efficiency and accuracy. 👏 Democratize, 👏 don’t compromise. With Soda Cloud, you can have the best of both worlds and foster a more accessible and secure data environment. ➡️ Read our latest blog to learn more: https://lnkd.in/dK9bVeBR
-
❌ Poor data quality can seriously impact business reporting and decision-making--but it doesn't have to. Join Hwai Teck (ARTHUR) Chionh and Vivi (Paraskevi) Belogianni for a 30-minute session on best practices for Data Science and Analytics teams to ensure reliable insights and operational efficiency. In this session you'll learn: * strategies to empower business users in creating meaningful data quality checks * ways to embed quality checks within data products for accurate reporting * how to implement data quality checks in your pipelines 📅 Date: Wednesday November 27 ⏰ Time: 8am Pacific | 11am Eastern | 5pm Europe Join us on 💻 LinkedIn Live below or register on 📝 Zoom: https://lnkd.in/eQTFKMcJ
How to Detect Bad-Quality Data in Business Reporting
www.linkedin.com
-
Bad data is 💰 expensive, 🤯 frustrating, and 🧰 avoidable. With the right data quality practices, you can ensure your 💹 business reports deliver accurate insights and enable better decisions. Hwai Teck (ARTHUR) Chionh and Vivi (Paraskevi) Belogianni will host a session tomorrow on how to detect and prevent bad-quality data breaking business reports. 🗓️ Wednesday 27 November 🕚 11:00 am Eastern 🕔 5:00 pm Europe 📍 LinkedIn Live: https://lnkd.in/ejMv8NhA OR Zoom: https://lnkd.in/eQTFKMcJ In just 30 minutes, you’ll learn how to: *Spot and resolve data quality issues before they impact decision-making. *Build stronger internal data products tailored for your reporting needs. *Engage business users in creating checks that improve the overall health of data quality
❌ Poor data quality can seriously impact business reporting and decision-making--but it doesn't have to. Join Hwai Teck (ARTHUR) Chionh and Vivi (Paraskevi) Belogianni for a 30-minute session on best practices for Data Science and Analytics teams to ensure reliable insights and operational efficiency. In this session you'll learn: * strategies to empower business users in creating meaningful data quality checks * ways to embed quality checks within data products for accurate reporting * how to implement data quality checks in your pipelines 📅 Date: Wednesday November 27 ⏰ Time: 8am Pacific | 11am Eastern | 5pm Europe Join us on 💻 LinkedIn Live below or register on 📝 Zoom: https://lnkd.in/eQTFKMcJ
How to Detect Bad-Quality Data in Business Reporting
www.linkedin.com
-
🕸 Data pipelines can become complex webs, making it tough to manage changes and understand data flow. Data contracts act as an API, providing a clear, formal description of datasets. By bringing in software engineering principles like encapsulation, data contracts help teams build reliable, modular pipelines—no more spaghetti pipelines! Watch Tom talk about how data contracts bring clarity and control.
Too much spaghetti pipelines? No regretti with data contracts as your API. Data pipelines can resemble spaghetti code: complex web of scripts that are hacked together and evolved over time. This complexity makes it increasingly difficult to implement changes and maintain an overview of the data flow. To tackle this challenge at scale, we must identify the components of our data pipelines and establish clear interfaces between them. Persisted datasets, such as tables, serve as the most common interface. Here’s where data contracts come into play. They provide a formal description of datasets, essentially acting as the API for data. By introducing software engineering principles like encapsulation, data contracts empower teams to avoid the pitfalls of spaghetti pipelines. Instead, we can build distinct data pipeline components where the internal workings are hidden, and users only need to understand the interface–the dataset described by the data contract. This encapsulation not only simplifies interactions but also enhances the reliability of data management. Data contracts enable you to leverage the power of interfaces within your data components and pipelines. How are you redesigning your data pipelines? Watch this video as I explore what people mean when they talk about the API for data.
-
To YAML, or not to YAML? That is the question. YAML often sparks debate, but we at Soda believe it's the best choice for enabling collaboration in data testing and contracts. Here's why: 1️⃣ Accessibility: YAML makes it easy for less technical users to tweak configurations with the right tooling. Combine this with a no-code UI like Soda’s, and non-technical stakeholders can author checks, test them, and propose updates to producers. 🤝 2️⃣ Collaboration: With YAML files integrated into Git workflows, data engineers get clean, actionable PRs instead of endless email threads. This fosters faster iteration and alignment between data producers and consumers. 🏭 3️⃣ Inclusivity: Data testing shouldn't rest solely on the shoulders of producers. YAML opens the door for business stakeholders with domain knowledge to contribute, building a robust picture of what "good data" looks like. 🖼 👇 As Tom shares in his post below, data testing is the backbone of reliable pipelines. Observability is critical, but without proactive testing, you're always chasing problems after the fact. After all, as that old saying goes: garbage in, garbage out. 🚮
Why is YAML the best syntax for data testing and contracts? Despite YAML having some drawbacks, we are convinced YAML is by far the best language for data testing. If you heard about Domain Specific Languages (DSLs, https://lnkd.in/ei2BdJhv ) before, you may know that they can be embedded in different syntaxes: You can create a complete new syntax (like eg LookML) or you can embed a DSL in a programming langauge, YAML, JSON or even XML back in the days. I was triggered to write down our thoughts on this because of this reddit post comparing GE with Soda: https://lnkd.in/ehT6PTtP Many of the less technical people in an organization can not put Python code in production, but they can tweak YAML files if they get the right tooling. Our no-code UI enables less technical people to author checks in a web UI, test them and propose them to the producers for inclusion in the test suite in the production data pipeline. As a data engineer producing data, you don't want endless email conversations with your consumers testing requirements. Instead, you want easy-to-merge PRs on your git repo with the extra checks requested by others in the organization. Taking a step back, both producers and consumers must collaborate to build up a picture of what good data looks like. Only if we enable everyone to contribute, will we be able to build proper unit testing. Producers are sometimes missing and other people in the business often have to contribute data knowledge. Data testing takes time to build just like any software test suite. I am glad to see that in the reddit conversation, everyone seems already aware that data testing is crucial being the preventative aspect of data quality. Observability alone doesn't cut it. Fully automated observability alone that helps to diagnose issues after the fact will not be sufficient to get reliable data. The more people can get involved in data testing, the more issues will be caught before they reach production. That is the true value of data testing and YAML helps in the collaboration between producers and anyone else in the business with data domain knowledge.