Here’s the definition of a data catalog and why a data catalog could be just the thing you need to meet the challenges of data and metadata management and collaboration.
Let’s understand the definition, use cases and value of a modern data catalog with a small story— Two data scientists walk into a library at the end of a long day….
Data scientist #1 to the librarian: “Can I get a copy of this book on statistical methods?” Goes on to share the name of the obscure book.
Data scientist #2 to Data scientist #1: “They’ll never be able to find that book.”
The librarian clacks away on the keyboard for a couple of seconds before replying:
“Found it! Here are the details of its author, publishing house and borrowing history. Oh, and someone left a comment saying they found it super useful for understanding logistic regressions. I can grab it for you in a jiffy.” 🤓
Data scientist #1 to Data scientist #2: “Ummmm… why can’t the same thing happen with our data?” 🤔
But, what if it could? Enter data catalogs—the missing layer in your data lake. Now get the data you need with the context you need! 💡
First… what is a data catalog? How would you define it?
As seen in the chat above, the simplest definition of a data catalog is that it is a library or inventory of all your data assets across your data sources—a place where all your data is neatly indexed, organized and kept ready for use.
(If Monica from Friends made a data catalog, this would be it—neat to the T!)
Here’s the definition of a data catalog according to leading research firm Gartner:
A data catalog creates and maintains an inventory of data assets through the discovery, description and organization of distributed datasets. The data catalog provides context to enable data stewards, data/business analysts, data engineers, data scientists and other line of business (LOB) data consumers to find and understand relevant datasets for the purpose of extracting business value.
But more importantly, Gartner goes on to say:
Modern machine-learning-augmented data catalogs automate various tedious tasks involved in data cataloging, including metadata discovery, ingestion, translation, enrichment and the creation of semantic relationships between metadata. These next-generation data catalogs can therefore propel enterprise metadata management projects by allowing business users to participate in understanding, enriching and using metadata to inform and further their data and analytics initiatives.
Thus, modern data catalogs can help you manage your metadata (aka metadata management) in a way that you can easily curate and access important business context around your data—along with your data itself. And that too from across all your data sources—from the cloud to your BI tools.
PSA: A data catalogue is exactly the same thing as a data catalog—just written in British English (grammar peeps are directed here)!
Sounds like a dream? Well, it’s possible!
A truly powerful data catalog can help you:
- Create a repository of all your data from various data sources, including notes on a data set’s structure, quality, definitions and usage
- Allow users to access the metadata alongside the data itself
- View and understand the lineage of the data—including the data source, the transformations applied, and who has been using it
- Ensure data consistency and accuracy by updating itself automagically, while allowing humans to edit and remain in the loop
- Simplify data governance and compliance by providing a graphical representation of the lineage of the data assets—tracing it across its lifecycle
This brings us to a related question…
What is the meaning of a data catalog?
Similar to the definition of a data catalog given above, “data catalog meaning” refers to a living catalog of your data assets—along with their context aka metadata in one place.
But wait, still not convinced on…
Why you even need a data catalog?
Here’s the short of it. If you need to use data, understand your data and where it came from, and share this data with your team securely—you need a data catalog.
Too oversimplified? Well, here’s the long of it.
If you’re reading this article, you know that companies today are dealing with vast amounts of data.
The Global Datasphere will grow from 33 Zettabytes (ZB) in 2018 to 175 ZB by 2025.Data Age 2025 report by IDC. The datasphere is defined as the sum of all data created, captured or replicated across core sources, edge sources and data endpoints.
If it feels like we’re all drinking from a data firehose, it’s because we are.Mary Meeker, Internet Trends 2019 report
And companies that can harness the enormous signal power of this data are expected to win.
- According to Booz Allen Hamilton’s Data Science Playbook, businesses that deploy analytics across most of the organization, align daily operations with senior management’s goals, and incorporate big data will see a 1,000 percent increase in ROI.
- As expected, companies are investing heavily in big data to gain a competitive edge—they were expected to invest $114 billion in big data in 2018, up from $31 billion in 2013.
But there are many challenges in this process of becoming data-driven.
One of the primary challenges is enabling teams to discover, understand, govern and consume the data they need to make better decisions.
The two biggest challenges in data management are centered around data catalogs—finding and identifying data that delivers value, and supporting data governance and data security.Gartner Data Management Strategy Survey 2017
And the stakes are high.
By 2022, over 60% of traditional IT-led data catalog projects that do not use ML to assist in finding and inventorying data distributed across a hybrid/Gartner research
multicloudecosystem will fail to be delivered on time, leading to derailed data management, analytics anddata science projects.
But don’t take our or Gartner’s word for it. The pain of siloed and missing data is real. Here’s what we saw on Reddit.
But the proof of the pudding lies in the eating, so we ask you—yes, the business analyst trying to uncover the mystery of column latest_Kirk_02122019_keep or the IT admin who’s tired of asking for email permissions to access data:
- How much time do you spend just looking for the data you need?
- How much do you even know about your data?
- Do you know the source of your data?
- Do you know the quality of your data?
- Can you rate your data assets?
- Can you get and give data access easily and securely?
If your answer to any of the above is a big resounding “UMMMMM”, the writing’s on the wall. It’s time to get a data catalog.
The need of the hour is to remove data silos, let analytics flow at the speed of thought and create a single source of truth for your entire team.
Oh, already out of the door to shop for the latest shiny data catalog tool? Not so fast mister, because…
Beware, there are metadata silos everywhere
Simply plugging in an isolated data catalog tool within your data lake may not be the answer to your data woes. Today’s business mandates that data be available for whoever needs it, wherever and whenever they need it (read more on DataOps here).
That’s why it’s essential for a data catalog tool to:
- Let its data stay updated automagically by crowdsourcing updates and knowledge (such as versions, lineage, user ratings); and
- Allow updated data to be plugged in across your data applications/analytics tools and platforms—thus creating one source for truth for your data.
So that everyone stays on the same data page! (And knows how to switch between pages or even other books!)
By 2024, machine-learning-augmented data preparation, data catalogs, data unificationGartner 2019 Market Guide for Data Preparation
anddata quality tools will converge into a consolidated modern enterprise information management platform used for the majority of new analytics projects.
Stay ahead of the curve. Don’t get yet another data catalog tool that will create siloed metadata catalogs. Adopt a data catalog tool that will let you bring your data, human tribal knowledge and business context together—in one place. And gets you brownie points from your compliance team! Now let’s look at some examples of a modern data catalog platform or software.
Examples of a modern data catalog tool
As a quick Google search will reveal, the data management/cataloging software market is ripe with many examples of data catalog platforms. Most of these data catalog tools profess to provide the same, oft-lauded benefits: catalog of your data, metadata in one place, and mechanisms to govern your data and make it usable.
But the problem with many of these examples or categories of data catalog tools is that they fail to deliver on the promise of data democratization. In simple words, while they bring your data and metadata in one place, the overall data experience is far below optimal and thus these tools are very likely (and ironically) doomed to become siloed tools themselves!
So what is the answer, you ask?
Going beyond traditional data catalogs with Atlan
As seen above, gone are the days when you could create one single catalog for your company via the IT Team and then direct everyone to use it. Today, the sources, users and use cases of data have multiplied and become dynamic. And data catalogs need to keep up with the times. That’s why yet another data catalog tool won’t make the cut.
We’ve put all these principles into action with Atlan—the home for data teams.
Introducing the first data catalog built for the future
We believe that Atlan can help you create a ‘living’ catalog that grows as your data and team grow. That’s why Atlan is the modern data management solution for the workplace of the future. With Atlan, you can:
Create a living catalog of all your data assets and knowledge: You can discover and access data with its context via an intuitive, Marketplace-like interface; create a single source of truth for your data across its applications; bring human tribal knowledge and business context alongside your data; and understand and improve data quality at every step of the way—automagically.
Integrate with all the tools you already love and use: Atlan helps your data live where you want it to and connect with the tools you choose. You can make it easy for anyone to use the data they need, whenever they need it, and in whichever format they desire.
Enjoy a Data UX designed for teamwork and collaboration: Atlan helps you stop data chaos with our reimagined Data Support Desk; get a bird’s eye view of your team’s activities with a data news feed and notifications; track data usage and adoption across your ecosystem, wherever your data goes; and stay on the same page with inbuilt version control and history.
As always, keep your humans first. Consider their needs and challenges.
Many companies have invested heavily in technology as a first step toward becoming data-oriented, but this alone clearly isn’t enough. Firms must become much more serious and creative about addressing the human side of data if they truly expect to derive meaningful business benefits.Randy Bean and Thomas H. Davenport, HBR
Quick recap, before you go
A modern data catalog will help you:
- Create a single source of truth for your data across all its applications
- Make data cataloging a part of your data processes, not an isolated activity
- Quickly access and share the insights you need via a centralized repository
- Enforce and simplify data security and compliance (GDPR, CCPA, etc.)
And that’s it! Time to go forth and jumpstart your data management strategy—create one source of truth for your data.
Try Atlan—a human-first way to manage, share and curate your data. Book your personalized demo now.