Is Data Science Failing to Deliver Value? Here is a simple explanation

subhojit banerjee

RAG engineer, Principal DataEngineer, Streaming, LLMOPS, MLOPS, AWS Certified Architect, Azure data engineer

Published Jan 21, 2023

Data scientists love to have control over their projects? Well, they really like being able to choose the best modeling approach for their project. They know that feature engineering is important for many models, so they want to be in charge of the model inputs and feature engineering logic. In many cases, data scientists are super excited to own their own models in production, because it allows them to troubleshoot and improve the models quickly. But, they don't really have strong opinions about the data warehouse, compute platform, or workflow scheduler. They just want those things to work well and for the error messages to be clear and easy to understand.

Machine Learning (ML) models are statistical artifacts that are created by training with data. They are not based on deterministic rules and can make very effective inferences from complex data that would be impractical or impossible to code using explicit rules. However, their effectiveness in making accurate inferences is dependent on the data they are being fed in production matching the conditions present when the training data was collected. As the world changes, and the "operating regime" reflected in the data presented to the model in production diverges from what was present in the training data, the accuracy of an ML model's predictions will degrade.

It is important to note that even under normal conditions, each model has a natural cadence at which it needs to be refreshed (retrained) in order to maintain its desired effectiveness. This can range from months or weeks to as short as a day or less. In extreme situations such as the Coronavirus pandemic, ML models can lose their predictive efficacy very rapidly.

Furthermore, ML models have very exacting technical requirements. If the production environment departs even in small ways from the environment used in development, the model may not operate properly. This highlights the importance of proper testing and validation of ML models before deployment.

The life cycle of ML models also requires participation from a large number of stakeholders, including the line of business (which is typically the sponsor for the model), the data science team that develops the model, the DevOps team that integrates the model into production applications, the DataOps team that supplies the data pipelines, the ITOps team that operates the production infrastructure, and the governance organization that ensures compliance with internal and external regulations.

In large complex enterprises with multiple business units, there can be hundreds or thousands of models, each with a unique life cycle in terms of its business KPIs, development platform, training and retraining criteria, production environment, reporting and alerting thresholds, approvals regime and compliance requirements. This emphasizes the need for a streamlined and efficient model management process to ensure the successful deployment and maintenance of ML models.

Tools such as Jupyter Notebooks can be highly beneficial, particularly in educational settings and when examining potential solutions to mathematical issues. However, it is important to note that like all rapid application development tools, they may compromise other important attributes such as maintainability, testability, and scalability.

Much of the research and development in the field of Machine Learning and Artificial Intelligence has been driven by Data Science teams, rather than Computer Science teams. While this specialization has allowed for significant advancements in the field, it also means that a significant proportion of ML practitioners may not have been exposed to the lessons and best practices of managing software assets in commercial environments over the past seventy years.

This can result in significant conceptual gaps between what is involved in creating a proof of concept of a trained ML model on a Data Scientist's laptop, and what is required to safely transition that model into a commercial product in production environments. It is not unfair to say that the current state of MLOps is still on an early path towards maturity, and that much of the early challenge for adoption will be one of education and communication, rather than technical refinements of tools alone.

To achieve real value from ML projects, healthy companies should:

Make data producers accountable for the delivery, quality, and timeliness of their data. The team responsible for a search page should also be responsible for ingesting the data generated by that page, with support from the Data Platform team.
Involve SRE teams in the implementation of challenging systems, such as GPU workflows or Spark infra for ML jobs, rather than just offering best practice recommendations. SRE teams should be tied to the delivery incentives of the teams they serve.
Allow ML engineers autonomy and minimize bureaucratic tasks to maximize their effectiveness in model training and optimization. Allocate other team members to maintenance, on-call responsibilities, and compliance tasks to allow ML engineers to focus on their comparative advantage.

To implement these culture changes, it's important to resolve internal engineering politics and prioritize getting value from ML over fairness in task distribution. Otherwise, wasted money and high turnover on the ML teams may occur due to the broken culture upstream of the actual dev workflow tactics.

More such nuggets in Mlopsweekly newsletter #mlops #datascience #machinelearning

Is Data Science Failing to Deliver Value? Here is a simple explanation

subhojit banerjee

RAG engineer, Principal DataEngineer, Streaming, LLMOPS, MLOPS, AWS Certified Architect, Azure data engineer

More articles by this author

Insights from the community

Others also viewed

Data Structures in Modern Software Development and Services - Role of Data Structures in Modern Software Development and Service Architectures

Optimizing Data Pipelines for AI: Best Practices for High-Performance Workflows

Data Science: Unleashing the Power of Information

Unleashing the Power of Generative AI in Data Engineering: Transforming the Modern Data Stack

Data Engineering in the Era of Machine Learning – Key Insights and Best Practices

Data Science: The Future of AI and Analytics

How LLMs are Automating Data Engineering Tasks?

Data Science Revolutionizing the Way We Analyze Data

Unveiling the Power of Data Science: A Technical Deep Dive

Data Analytics Vs Data Science

Explore topics

EU's New Rules Will Change the Game for Tech Giants

Sep 7, 2023

Analyzing 96 postmortems of production ML failures

Oct 19, 2020

Spark + Mleap + Docker + Kubernetes = (near) Real time scalable ML models

Jan 5, 2019

NO you don’t need personal data for personalization

Nov 20, 2017

Apache Spark and Amazon S3 — Gotchas and best practices

Nov 20, 2016