As we enter the new decade, one thing is clear: the explosive growth of data science and AI has made the effective application of them a critical differentiator for any enterprise. Despite the near-universal acknowledgement of this phenomenon and the major investments being made, many enterprises struggle to deliver sustained value on their data science initiatives.
In Gartner’s Predicts 2019 report on AI, they found that 85% of use cases leveraging AI techniques will fail. While some failure is inevitable in data science (that’s why we call it science), that’s a staggeringly high number. Even companies that have succeeded with some of their data science initiatives struggle to replicate that success and scale it across their enterprise.
Until recently, data science projects have pretty much been the Wild West. Data science teams everywhere have been cobbling together various workflows and tools in an attempt to deliver value consistently. Within a single organization, teams may have wildly different processes, leading to inconsistency in quality, robustness, transparency, and speed.
Many organizations are currently leveraging Machine Learning as a component of their data science initiatives. We are seeing the proliferation of Machine Learning Operations (MLOps), which enables teams to successfully implement, deploy, and monitor their models to production. While MLOps is a critical part of a successful enterprise data strategy, by itself, it fails to provide any guidance for the steps that come prior to building and training a model: namely feature creation, data acquisition, exploration, and asking the right questions. To expand the machine learning lifecycle to the more holistic data science lifecycle, we need to add workflows for three critical pieces:
- Asking and refining questions
- Exploring and publishing findings
- Collaborating on experiments
The process today
Across the data science space, we were seeing teams hitting the same challenges. Things like:
- High cycle times
- Solving the wrong problems
- Failure to put results in production
- Lack of collaboration
- Lack of transparency
- Lack of reproducibility
- Key-person risk
- Duplicated efforts
- Fragile deployments
- Hidden tech debt from ML
– as well as other headaches.
If teams aren’t solving the right problems and asking the right questions, it doesn’t matter how good their technical execution is. In most organizations, defining the problem is incredibly informal, occurring over a combination of email, chat, meetings, and random documents that float around the organization (often with multiple versions circulating at the same time). Even for those who are using a project planning tool, there are often major discrepancies between what a ticket says and what the data scientists actually end up doing. To address this challenge, we need to bring everyone to the table (data scientists, engineers, IT, and the business) to collaborate on what problems should be solved and how the team will approach solving them.
Another challenge we attempted to solve is what teams should do with data exploration and analysis. Often, this was where the process really broke down. Previously, the options were to have everyone keep their exploration to themselves or to publish everything. In the first option, we lose a lot of knowledge and end up duplicating work. In the other, our repo gets flooded with notebooks and the noise drowns out the value. Our goal here is to keep that information, but only have it show up when it’s contextually relevant.
We used the lessons learned from our experience as data scientists to build a simple, effective, and lightweight data science lifecycle process. Our goal was to give our enterprise customers a framework that would enable them to accelerate their data science initiatives. The process needed to be easy to learn, easy to adopt, and easy to modify to fit the needs of every business. Unlike some of the other approaches we’ve seen, this process makes minimal assumptions about team structure or organizational patterns.
In practice, we created a series of Issue and Pull Request templates, a common set of labels, and a coherent branching strategy. We can leverage these to keep everyone on the same page and enable “opt-in” workflows. In addition to tracking code, by using Issues and Pull Requests our customers retain a full history of all questions, experiments, explorations, and solutions. This creates more transparency and enables greater collaboration across the organization.
When we combined this templated process with MLOps and DataOps, we ended up with a powerful end-to-end process that makes the data science lifecycle enterprise-ready and empowers everyone to bring more value. Because we’re clever, we’ve named this The Data Science Lifecycle Process.
Going forward, we’re looking to help more of our customers scale their data science efforts. As we continue to develop the DSLP, we’ll be sharing more. For now, we’re working with select customers in private preview to deliver workshops that teach this new process. If you’re interested in engaging with us, please email firstname.lastname@example.org.