Unused Models
Why you should be measuring data adoption and data retentionThis is the third in a series of blog posts about challenges teams may face when modernizing their data stack and ETL pipelines. We'll focus especially on situations where, despite every effort being made to produce high-quality, reliable data, somehow stakeholders end up receiving lower-quality or unreliable data. See also:
Introduction
Much has been written on unused models, and while there's no confusion on what they are and some of the downsides, there's a lot deeper impact beyond just the extra compute and storage used by excess data. To be extra clear though, when we say "unused models" we're referring to kimball / dimensional models, not machine learning or AI models. For the purposes of this article, wherever you see "unused model" you could substitute that for "unused SQL table".
In this post, we'll talk about all the problems that come from unused models, why they occur, why they are at their core a signal of a product miss, and how to detect and prevent them.
Impact of unused models
There's a number of reasons you might want to get a handle on unused models in your data warehouse.
-
Extra compute and storage: Unused models consume compute and storage resources that can quickly add up on your data warehouse bill. If your modeling tool requires its own compute resources, you're paying for that too.
-
Maintainance Overhead: All models require maintenance.
- Transformations may break when upstream data sources change
- Freshness, volume, or anomaly alerts may need to be tuned
- Tests need to be maintained
- Contracts need to be evolved over time
- Documentation and catalogs need to be updated
Unused models still require this maintenance but don't provide any value in return.
-
Slower Dev Cycles: In general, every model can increase the time it takes to run your pipeline end to end. This means slower cycles when developing changes locally. Research has shown that slowing down a developer's inner loop by even a few seconds can have a huge impact on productivity. Even if it's just a few extra models here and there, that can add up across hundreds of iterations from tens of engineers.
-
Alerts: Every data engineer I know has, at least one time, been paged at 2am to fix a job that feeds an "important" report or model, only to find out two days later at the retro that this model hasn't been used in months.
-
Security and Compliance: The fewer models you have, the fewer models you have to protect. Every table you maintain is a liability of some scale. Don't carry risk you don't need to.
-
Congitive Overhead: In addition to task-based maintenance overhead, there's a general cognitive overhead that goes with the ever-increasing volume and complexity of models under management. If you don't believe me, pull up the (very cool and sophisticated) lineage graph for GitLab's churn model, wait a full 10+ seconds for it to load, and then feel that delightful rush of anxiety as you try to imagine being responsible for keeping it high-quality and highly available.
Why Production Models Go Unused
There are a number of reasons why a model might go unused. Here are a few of the most common:
-
Lack of Awareness: Sometimes, stakeholders or analysts don't know that a model exists, or that it's available to them.
-
Missing Rows: Sometimes, a model is missing records that are needed for a particular use case. For example, a sales model that excludes cancelled orders might be great for forecasting, but not for calculating churn.
-
Missing Columns: Sometimes, a model is missing columns that are needed for a particular use case. For example, a new field may have been added upstream in a product database, but downstream models haven't been updated to include it.
-
No Longer Needed: Sometimes, a model is no longer needed. For example, a model that was used to calculate a metric that is no longer relevant, or a one-time dataset used for training a machine-learning experiment.
Detecting Unused models
There's a few ways you can go about trying to detect and respond to unused models.
-
Query model Graphs: many tools offer metadata catalogs that enable you to inspect how different tables and models are connected. The team at Ramp did a guest post on the select.dev blog that walks through one approach to this with DBT.
-
Explore Observability tools: Many observability tools use sophisticated lineage approaches to understand how data flows through your system. For example, BigEye can be used to understand how data flows through your system, and Datafold can be used to understand how data changes over time.
-
Track Product Metrics: Use a tool like Metalytics to understand standard product metrics like activation, adoption, and retention for your data assets.
-
Talk to your users: Find out which models are most important. Ask people what they liked about things you've shipped recently. Find out what they're not using or that they don't know about.
Remediation and Prevention
As usual, if you've made it this far, you're ready to start tackling unused models in your warehouse. If you discover an unused model, you have three options:
-
Educate your team: Ensure you have good documentation for production assets. If that's not enough, host regular office hours or training sessions to help people get up to speed.
-
Adjust models so they are useful: If the problem is that the model is missing rows or columns, or is just hard to use, adjust it so that it is valuable and usable.
-
Deprecate and remove: If you verify that the model is truly unneeded, it's time to thank it for its service and free up that storage and compute for what matters.
Overall, consider preventing the problem before it gets out of hand
-
Definition of Done: Make product metrics like activation, adoption, and retention part of your "definition of done" for data assets. Don't move on to the next data project until stakeholders are onboarded into an asset, and, where applicable, have integrated it into their recurring workflows.
-
Empower your Team: If you can measure it, you can improve it. Give your team the tools they need to understand how their work is being used, and empower them to make decisions about what to prioritize and what to deprecate.
-
Talk to your users: I'll say it at least twice in every post. All the tech in the world won't make up for lack of context on what stakeholders and analysts need and how they'll use it. If your primary communication with stakeholders is 1-2 comments on a ticket, you're missing invaluable information about your users.
That's it. No soapbox this time. Go get a machete and start hacking away at those unused models. With any luck, you'll make a commit like this one soon: