Active metadata is the latest category from Gartner, and it’s a transformational leap from today’s augmented data catalogs.
Metadata management just got shaken up with Gartner scrapping its Magic Quadrant for Metadata Management Solutions and replacing it with the Market Guide for Active Metadata. See the difference? With that change, Gartner just introduced Active Metadata as a new category for the future.
As with any new category in the data ecosystem, this announcement comes with a ton of excitement, some healthy skepticism, and loads of questions.
- What exactly is active metadata?
- How is it different from augmented data catalogs and other technologies we’ve seen before?
- What does an active metadata platform look like?
I’ve written previously about what an active metadata platform and its key characteristics are. Today, I want to go one step further from this abstract discussion and paint a picture of what an active metadata platform could look like, break down the key components, and give some real-life use cases of active metadata.
TL;DR: What does an active metadata platform look like?
In my mind, an active metadata platform has 5 key components:
- The metadata lake: A unified repository to store all kinds of metadata, in raw and processed forms, built on open APIs and powered by a knowledge graph.
- Programmable-intelligence bots: A framework that allows teams to create customizable ML or data science algorithms to drive intelligence.
- Embedded collaboration plugins: A set of integrations, unified by the common metadata layer, that seamlessly integrate data tools with each data team’s daily workflow.
- Data process automation: An easy way to build, deploy, and manage workflow automation bots that will emulate human decision-making processes to manage a data ecosystem.
- Reverse metadata: Orchestration to make relevant metadata available to the end user, wherever and whenever they need it, rather than in a standalone catalog.
1. The metadata lake: A single central store for metadata
A few quarters ago, I wrote about the concept of a metadata lake: a unified repository to store all kinds of metadata, in raw and further processed forms, which can be used to drive both the use cases we know of today and those of tomorrow.
Active metadata is built on the premise of actively finding, enriching, inventorying, and using all of this metadata, taking a traditionally “passive” technology and making it truly action-oriented.
The cornerstone of any active metadata platform, the metadata lake has two key characteristics:
- Open APIs and interfaces: The metadata lake needs to be easily accessible, not just as a data store but via open APIs. This makes it incredibly easy to draw on a single store of metadata at every stage of the modern data stack to drive a variety of use cases, such as discovery, observability, and lineage.
- Powered by a knowledge graph: Metadata’s true potential is unlocked when all the connections between data assets come alive. The knowledge graph architecture — which powers some of the world’s largest internet companies like Google, Facebook, and Uber — is the most promising candidate to make these metadata connections come alive.
2. Programmable-intelligence bots
We’re fast approaching a world where metadata itself is becoming big data, and making sense of this metadata is key to creating modern data management ecosystems.
Metadata intelligence has the potential to impact every aspect of the data lifecycle. It could parse SQL query logs to automatically create column-level lineage. It could auto-identify PII (personally identifiable information) data to protect private information. It could catch bad data, before it catches us, by automatically detecting data outliers and anomalies. In the past few years, metadata has seen some innovation in this regard, and “augmented” data catalogs have become become more and more popular.
However, in all the hype, I believe there’s one thing that we’ve gotten wrong so far about how intelligence would apply to data management — one size doesn’t fit all.
Every company is unique. Every industry is unique. Every individual team’s data is unique.
On a recent call with a data leader, he criticized his tool to detect data quality anomalies: Sometimes the tool sends us useful alerts about schema changes and quality issues. Other times, it screams about stuff that it shouldn’t be screaming about and really frustrates our data engineering team.”
I don’t blame the tool. The reality is that every machine learning algorithm’s output is a function of the training data that goes in. No one algorithm will magically create context, identify anomalies, and achieve the intelligent data management dream — and succeed 100% of the time for every industry, every company, and every use case. As much as I wish there were, there’s no silver bullet.
This is why I believe that the future of intelligence in active metadata platforms is not a single algorithm that magically solves all our problems. Rather, it’s a framework that allows teams to create programmable-intelligence bots that can easily be customized to different contexts and use cases.
Here are a few examples of programmable-intelligence bots:
- As security and compliance requirements go mainstream, companies will have to follow more rules — e.g. industry-specific ones like HIPAA for healthcare data and BCBS 239 for banking, or locale-specific ones like GDPR in Europe and CCPA in California. Bots could be used to identify and tag sensitive columns based on the regulations that apply to each company.
- Companies that have specific naming conventions for their datasets could create bots to automatically organize, classify, and tag their data ecosystem based on preset rules.
- Companies could take out-of-the-box observability and data quality algorithms, and customize them to their data ecosystems and use cases.
The use cases for programmable intelligence are endless, and I’m extremely excited about what the future holds!
3. Embedded collaboration plugins
Today, data teams are more diverse than ever. They’re made up of data engineers, analysts, analytics engineers, data scientists, product managers, business analysts, citizen data scientists, and more.
These diverse data teams use equally diverse data tools, everything from SQL, Looker, and Jupyter to Python, Tableau, dbt, and R. Add a ton of collaboration tools (like Slack, JIRA, and email), and you’ve made the life of a data professional a nightmare.
Because of the fundamental diversity in data teams, data tools need to be designed to integrate seamlessly with each team’s daily workflow.
This is where the idea of embedded collaboration comes alive. Instead of jumping from tool to tool, embedded collaboration is about work happening wherever each data team member lives, with less friction and less context-switching.
Here are a few examples of what embedded collaboration could look like:
- What if you could request access to a data asset when you get a link, just like with Google Docs, and the owner could get the request on Slack and approve or reject it right there?
- What if, when you’re inspecting a data asset and need to report an issue, you could trigger a support request that’s perfectly integrated with your engineering team’s JIRA workflow?
The action layer in active metadata platforms is what will make embedded collaboration finally come alive. I see this layer as a Zapier for the modern data stack — unified by the common metadata layer, and allowing teams to customize apps for their own unique workflows.
4. Data process automation
A few years ago, a new category of tooling called Robotic Process Automation (RPA) took the enterprise world by storm. From UiPath, RPA is “a software technology that makes it easy to build, deploy, and manage software robots that emulate humans actions interacting with digital systems and software”.
As concepts like data fabrics, data meshes, and DataOps become mainstream in the way we think about data platforms, they’ll give rise to the need for Data Process Automation (DPA) — an easy way to build, deploy, and manage workflow automation bots that will emulate human decision-making processes or actions to manage your data ecosystem.
Have you ever been frustrated by the dashboard load speed on a Monday morning? Or worse, surprised by a crazy high bill from AWS at the end of a month?
With active metadata platforms, it isn’t hard to imagine a world where neither would happen again. A true active metadata platform could recommend parameterized instructions to adjacent data management tools for operations such as resource allocation and job management.
For example, by leveraging metadata from a variety of sources — such as the top BI dashboards along with time of peak usage from the BI tool, past data pipeline run stats from the data pipeline tool, and past compute performance from the warehouse — you can imagine a world where the active metadata platform doesn’t just recommend parameters for scaling up a Snowflake warehouse, but actually leverages DPA to allocate warehouse resources.
5. Reverse metadata
I believe that one of the greatest things about the last few years is the rise of truly “modern data stack” companies and entrepreneurs that believe that amazing user experience trumps everything else.
While the old era was all about “value capture”, the new breed of entrepreneurs are focused on “value creation” — with the end user experience coming first. Modern data stack companies are increasingly interested in genuinely partnering with one another to integrate their product roadmaps and create a better user experience.
Active metadata holds the key to truly unlocking these partnerships, and this where I think “reverse metadata” will change the game.
Reverse metadata is about metadata not being available in a “standalone data catalog”. Instead, it’s about making relevant metadata available to the end user, wherever and whenever they need it, to help them to do their job better.
For example, at Atlan, our reverse metadata integration with Looker shows “context” (like who owns a dashboard, metrics definitions and documentation, and more) directly within Looker.
Active metadata platforms can help orchestrate useful metadata across the modern data stack, making all the various tools in the stack more useful — without investing in custom integrations between every tool.
Summing up
In my opinion, the most prophetic sentence in Gartner’s report was, “The stand-alone metadata management platform will be refocused from augmented data catalogs to a metadata ‘anywhere’ orchestration platform.”
We’re just getting started with active metadata, as we work together to figure out the role it could play in today and tomorrow’s data ecosystem. I hope this article shone some light on what that future could look like, moving it from the abstract to something much more real.
This article was originally published on Towards Data Science.