Here’s why data catalogs could be just the thing you need to meet the challenges of data and metadata management and collaboration
Two data scientists walk into a library at the end of a long day….
Data scientist #1 to the librarian: “Can I get a copy of this book on statistical methods?” Goes on to share the name of the obscure book.
Data scientist #2 to Data scientist #1: “They’ll never be able to find that book.”
The librarian clacks away on the keyboard for a couple of seconds before replying:
“Found it! Here are the details of its author, publishing house and borrowing history. Oh, and someone left a comment saying they found it super useful for understanding logistic regressions. I can grab it for you in a jiffy.” 🤓
Data scientist #1 to Data scientist #2: “Ummmm… why can’t the same thing happen with our data?” 🤔
But, what if it could? Enter data catalogs—the missing link in your data lake. Now get the data you need with the context you need! 💡
First… what is a data catalog?
As seen in the chat above, a data catalog is a library or inventory of all your data assets—a place where all your data is neatly indexed, organized and kept ready for use.
(If Monica from Friends made a data catalog, this would be it—neat to the T!)
According to leading research firm Gartner:
A data catalog creates and maintains an inventory of data assets through the discovery, description and organization of distributed datasets. The data catalog provides context to enable data stewards, data/business analysts, data engineers, data scientists and other line of business (LOB) data consumers to find and understand relevant datasets for the purpose of extracting business value.
But more importantly, Gartner goes on to say:
Modern machine-learning-augmented data catalogs automate various tedious tasks involved in data cataloging, including metadata discovery, ingestion, translation, enrichment and the creation of semantic relationships between metadata. These next-generation data catalogs can therefore propel enterprise metadata management projects by allowing business users to participate in understanding, enriching and using metadata to inform and further their data and analytics initiatives.
Thus, modern data catalogs can help you manage your metadata (aka metadata management) in a way that you can easily curate and access important business context around your data—along with your data itself.
Sounds like a dream? Well, it’s possible!
A truly powerful data catalog can help you:
- Create a repository of all your data, including notes on a data set’s structure, quality, definitions and usage
- Allow users to access the metadata alongside the data itself
- View and understand the lineage of the data—including the source, the transformations applied, and who has been using it
- Ensure data consistency and accuracy by updating itself automagically, while allowing humans to edit and remain in the loop
- Simplify data governance and compliance by providing a graphical representation of the lineage of the data assets—tracing it across its lifecycle
But wait, still not convinced on…
Why you even need a data catalog?
Here’s the short of it. If you need to use data, understand your data and where it came from, and share this data with your team securely—you need a data catalog.
Too oversimplified? Well, here’s the long of it.
If you’re reading this article, you know that companies today are dealing with vast amounts of data.
The Global Datasphere will grow from 33 Zettabytes (ZB) in 2018 to 175 ZB by 2025.Data Age 2025 report by IDC. The datasphere is defined as the sum of all data created, captured or replicated across core sources, edge sources and data endpoints.
If it feels like we’re all drinking from a data firehose, it’s because we are.Mary Meeker, Internet Trends 2019 report
And companies that can harness the enormous signal power of this data are expected to win.
- According to Booz Allen Hamilton’s Data Science Playbook, businesses that deploy analytics across most of the organization, align daily operations with senior management’s goals, and incorporate big data will see a 1,000 percent increase in ROI.
- As expected, companies are investing heavily in big data to gain a competitive edge—they were expected to invest $114 billion in big data in 2018, up from $31 billion in 2013.
But there are many challenges in this process of becoming data-driven.
One of the primary challenges is enabling teams to discover, understand, govern and consume the data they need to make better decisions.
The two biggest challenges in data management are centered around data catalogs—finding and identifying data that delivers value, and supporting data governance and data security.Gartner Data Management Strategy Survey 2017
And the stakes are high.
By 2022, over 60% of traditional IT-led data catalog projects that do not use ML to assist in finding and inventorying data distributed across a hybrid/Gartner research
multicloudecosystem will fail to be delivered on time, leading to derailed data management, analytics anddata science projects.
But don’t take our or Gartner’s word for it. The pain of siloed and missing data is real. Here’s what we saw on Reddit.
But the proof of the pudding lies in the eating, so we ask you—yes, the business analyst trying to uncover the mystery of column latest_Kirk_02122019_keep or the IT admin who’s tired of asking for email permissions to access data:
- How much time do you spend just looking for the data you need?
- How much do you even know about your data?
- Do you know the source of your data?
- Do you know the quality of your data?
- Can you rate your data assets?
- Can you get and give data access easily and securely?
If your answer to any of the above is a big resounding “UMMMMM”, the writing’s on the wall. It’s time to get a data catalog.
The need of the hour is to remove data silos, let analytics flow at the speed of thought and create a single source of truth for your entire team.
Oh, already out of the door to shop for the latest shiny data catalog tool? Not so fast mister, because…
Beware, there are metadata silos everywhere
Simply plugging in an isolated data catalog tool within your data lake may not be the answer to your data woes. Today’s business mandates that data be available for whoever needs it, wherever and whenever they need it (read more on DataOps here).
That’s why it’s essential for a data catalog tool to:
- Let its data stay updated automagically by crowdsourcing updates and knowledge (such as versions, lineage, user ratings); and
- Allow updated data to be plugged in across your data applications/analytics tools and platforms—thus creating one source for truth for your data.
So that everyone stays on the same data page! (And knows how to switch between pages or even other books!)
By 2024, machine-learning-augmented data preparation, data catalogs, data unificationGartner 2019 Market Guide for Data Preparation
anddata quality tools will converge into a consolidated modern enterprise information management platform used for the majority of new analytics projects.
Stay ahead of the curve. Don’t get yet another data catalog tool that will create siloed metadata catalogs. Adopt a data catalog tool that will let you bring your data, human tribal knowledge and business context together—in one place. And gets you brownie points from your compliance team!
Going beyond traditional data catalogs with Atlan
As seen above, gone are the days when you could create one single catalog for your company via the IT Team and then direct everyone to use it. Today, the sources, users and use cases of data have multiplied and become dynamic. And data catalogs need to keep up with the times. That’s why yet another data catalog tool won’t make the cut.
We’ve put all these principles into action with Atlan—the home for data teams.
We believe that Atlan can help you create a ‘living’ catalog that grows as your data and team grow. With Atlan, you can:
Give data its own profile: You can now create a snapshot of your data with Atlan—just like sales leads on SalesForce or code on GitHub. You can easily discover, understand, share and collaborate on data in one place. Shareable via a simple URL!
Bring human tribal knowledge alongside data: As business context is just as important as the data itself, Atlan helps you add and share your metadata right where your data exists via README summaries, ratings, discussions and notes.
Integrate with all the tools you already love and use: Atlan helps you create one source of truth across your entire data ecosystem by letting you seamlessly plug in data from your catalog into any other downstream or upstream application, from Microsoft Excel to PowerBI.
As always, keep your humans first. Consider their needs and challenges.
Many companies have invested heavily in technology as a first step toward becoming data-oriented, but this alone clearly isn’t enough. Firms must become much more serious and creative about addressing the human side of data if they truly expect to derive meaningful business benefits.Randy Bean and Thomas H. Davenport, HBR
Quick recap, before you go
A modern data catalog will help you:
- Create a single source of truth for your data across all its applications
- Make data cataloging a part of your data processes, not an isolated activity
- Quickly access and share the insights you need via a centralized repository
- Enforce and simplify data security and compliance (GDPR, CCPA, etc.)
And that’s it! Time to go forth and jumpstart your data management strategy—create one source of truth for your data.
Try Atlan—a human-first way to manage, share and curate your data. Request early access now.