Understand how you can meet the challenges of ensuring data quality and accuracy with a modern data dictionary.
Raise your hand if this has happened to you too. ✋
Or this one.
Often, the humans of data (aka folks like you) spend an insane amount of time figuring out what data means and whether or not it’s credible.
80% of a data scientist’s valuable time is spent simply finding, cleaning, and organizing data, leaving only 20% to actually perform analysis.Harvard Business Review
Studies show that knowledge workers waste up to 50% of time hunting for data, identifying and correcting errors, and seeking confirmatory sources for data they do not trust.Harvard Business Review
Here’s what some humans of data said on Reddit when discussing nightmarish situations at work:
What if you could create a central repository for all your data—that lets you verify your data’s credibility (for example, queries such as minimum, maximum and frequency)?
Tip: If you’re looking for a quick refresher on sanity checks for cleaning and organizing data, check out our ultimate guide to data cleaning here.
Even better—what if this repository already ran the standard data check queries for you? No more dependencies on IT to give you a data set health report! Think of all the time and energy you would be saving!
Well, your prayers have been answered—enter the modern data dictionary. 🚀
I know, sounds too good to be true. It’s not, though. Let’s see how a modern data dictionary can help you resolve your trust issues with data.
But hey, first things first… let’s understand the basics of a data dictionary.
What is a modern data dictionary?
The short of it: it’s not just a database dictionary. Nor is it a business glossary.
And now, for the long of it, buckle up!
Here’s a question for you: what do you do when you come across a word you don’t know? You look it up in a dictionary. A modern data dictionary is just like that. It’s the go-to tool for the humans of data (i.e. you) to understand everything about their data sets and verify data credibility at a glance.
A good example of a data dictionary would be Atlan’s auto-generated data dictionary, which provides you with information such as variable name, description, type
Keep in mind that the modern data dictionary goes beyond traditional database dictionaries (also known as metadata repositories) that just store all the metadata. After all, data isn’t just rows x columns. It has a lot of attributes that make up its complete profile. And a modern data dictionary plays a big part in building that profile.
Tip: Wondering what else plays a role in building data profiles? The answer: data catalogs. Read this explainer blog on data catalogs to find out how.
Ummm… how is this any different from a business glossary? Or is it data glossary? 🤔
Hold your horses! We can see how all of this might get confusing. Fret not, here’s how a business glossary (also known as a data glossary) is different from a data dictionary.
Traditionally, data dictionaries referred to database dictionaries, which covered variable names, types, descriptions, frequencies and other such information on data sets.
Within that environment, a data dictionary wasn’t enough as it only made sense to engineering, operations or IT, not to business.
Enter the business glossary (or enterprise business glossary)—defining business terms used within an organization. For instance, a variable like date and its specifications would be an example of an entry in the data dictionary.
Whereas a term like
What’s the key difference between a data dictionary and a business glossary, other than the former being owned by IT and the later by business?
Well, the same variable might occur in different data sets, with different meanings. A variable like name could stand for the first name in one data set while the last name in another data set.
However, the terms covered in a business glossary have the same definition (and interpretation) across the organization. So, how marketing defines Customers would be the same as how Sales or Support define Customers.
Hope that clears out all the confusion. I know, that was quite the information overload. 🤯
Let’s take a minute and recap. So far we’ve understood what is a modern data dictionary, how it’s different from the traditional data dictionary (aka database dictionary aka metadata repository) and also what business glossary (or enterprise business glossary) and data glossary really mean.
Psst… Need a refresher on what exactly is metadata? Here you go!
Metadata provides information about the structures that store data. If your data is an image, its metadata includes (but isn’t limited to) file name, format, size, dimensions, date, time and location.
If your data is in a relational database, its metadata could include the variable name, data type, length/size and description.
The traditional data dictionary (or metadata repository) stores information about metadata.
Bonus Tip: Now wasn’t that handy? 🙌 Oh hey, if you want to read up on metadata silos, we’ve got you covered! Here’s an article on data catalogs featuring the challenges that metadata silos present. And if you’re looking for a solution to manage your metadata easily, see how Atlan can help.
Wondering what an ideal data dictionary would look like?
Well, in a world where working with data is a delightful experience, the ideal data dictionary would have information such as:
- Variable names, types, descriptions/definitions and frequency (how often do these variables appear) within data sets
- Owners and editors of data sets that contain these variables
- Discussions around each variable (stored as tags or notes)
- A first-level check (preliminary stats such as minimum, maximum, frequency and more) or calculations with diagrams or charts that help you determine data quality at a glance
And in this world, everyone within your organization speaks the same language—the analysts and scientists understand the engineers, the business teams get along well with the IT team… isn’t that the dream?
And if that wasn’t enough, here’s the cherry on top of the cake—with a data dictionary in place, you’ll never have to deal with loss of context or human tribal knowledge since it stores information such as discussions, tags, notes, edit history and more at a single location.
Summing it all up, an ideal data dictionary will help you answer the question “Is my data credible” instantly. Now if you’re still thinking why data dictionaries matter and how they make your life easy, read on…
What is the importance of a data dictionary?
The estimated global annual spend on data initiatives by companies in 2018 was $114 billion.The State of Data 2018 – A WinterberryGroup Report
And despite significant investments in data lakes, most organizations don’t have an easy way for humans to discover, access and share data.
What does this have to do with a data dictionary? Collecting vast amounts of data is useless if you can’t interpret or analyze it. In large organizations, data teams deal with enormous data sets. Usually, the database administrator or engineer handles transforming and storing this data in warehouses or databases or further analysis.
Now imagine if this person were to suddenly disappear tomorrow. Is there documentation somewhere that will explain everything that you need to know to take over the reins?
If you had a data dictionary in place, this problem wouldn’t be a problem in the first place.
And that’s just one challenge. Remember the data accuracy debacle (a harried Mike writing to IT) or the variable name conundrum from before (poor Dalia hunting down the meaning of column_xy881)?
A data dictionary can help you avoid such situations altogether.
But these aren’t the only reasons why you should care about a data dictionary. Here are some of the biggest benefits that a data dictionary has to offer:
- Detect data anomalies quickly: Identifying anomalies in data or missing data is easier since the dictionary displays the results of data checks such as minimum and maximum values or count of distinct values. Spot duplicate, inaccurate or questionable data at a glance.
- Evaluate data quality: When there’s a standard set of variable names and descriptions across an organization, it automatically you understand the quality of your data and makes data analysis quicker and easier. Quickly evaluate data quality and speed up your analysis!
- Get more trustworthy data: Since the data you store has all the information about it— sources, owners, descriptions, discussions—recorded in one place, it makes your data more reliable. Now you can truly say “In data, we trust!”
- Build transparency within data teams: When the entire organization understands what every detail within a data set means, it gets everyone on the same page, reduces dependencies, brings consistency in the way you use data and makes onboarding a breeze. From Dalia the Data Scientist to Sam the Business Manager, the lives of the humans of data just got way easier!
Well, now that you know how handy a data dictionary can be, let’s see how to create one.
How can you create a data dictionary?
To create a data dictionary, you should have answers to the following:
- What does each variable/element/field/attribute within a data set mean? What is it describing?
- How did you collect each variable? How did you measure it?
- If there are numeric values, are these values raw or are they calculated using a formula?
- What are the tests or checks you need to run to determine whether your data is trustworthy?
- Who collected your data? Are they still the owners or is it somebody else? Who has interacted with your data and what are the changes that they made? Who oversees the changes made to your data?
- How can you reach out to the owners, admins and editors of your data?
Tip: Data dictionary best practices
Now you might notice that it’s harder to find these answers once your data’s already modeled, prepped and being actively used for analysis.
That’s why it’s always a best practice to start building a data dictionary right when you’re modeling your data—makes it a lot easier to define what each variable stands for, how it is being measured/calculated, who can make changes and who is responsible for monitoring the changes made.
BTW… And as far as the credibility problem is concerned, it’s always best to tackle the problem at its source. Fixing bad data after it’s already in use is never a good solution. Instead, you should make sure that only accurate data enters your systems. For more on this, check out our blog on data management.
So what comes next? You guessed it—discovering tools that will help you create a data dictionary. Meet Atlan—the human-first DataOps platform that solves your data dictionary challenges automagically.
Bonus content alert! Check out this article to find out how DataOps brings together your data, team, tools and processes in one place and helps you become data-driven.
How can you create a data dictionary with Atlan?
With Atlan’s modern data catalog, your data dictionary is auto-generated, all you need to do is click on a button… et voilà!
4 reasons why you should take Atlan’s modern data dictionary for a spin
Atlan’s modern data dictionary lets you:
- Capture human context, monitor data quality and compare and query different revisions.
- Become more independent by running preliminary data checks for you.
- Add context to your data with tags and descriptions.
- Identify data anomalies by observing the basic stats it provides for each column. For example, if the minimum value of total sales is negative, then you definitely know there is an error.
Stop going down the “what does this variable mean?” or “is our data credible?” rabbit holes. Instead, use all that time and energy doing what you do best. In data you can FINALLY trust. Thanks to Atlan’s data dictionary: a complete health check report on all your data.
Learn more about managing data quality with Atlan by reaching out to us and setting up a demo here.