Snapcommerce is building the next generation of mobile shopping across three verticals: travel, fintech and goods. As we’ve quickly scaled from one to three verticals, our business stakeholders have remained active users of our data platform and assets. We are a tech-savvy organization, and most Snapcommerce employees autonomously write SQL and build dashboards/reports to solve their day-to-day questions. We recognized a need for source-of-truth documentation in a user-friendly format that would support our ongoing requirement and adoration for self-serve tools. A data catalog serves that need nicely.
What is a Data Catalog?
A data catalog is a tool that consolidates and organizes your collection of data assets. A data asset can vary among many things — data tables, columns, metric definitions, column lineage from model to model. An effective data catalog can be viewed as a one-stop shop for business and data stakeholders to answer the vast majority of documentation-related questions that arise.
Why We Care
Snapcommerce was looking for a way to standardize and share our data definitions across the organization. We also wanted a solution that eliminated the need for coding by business stakeholders, and that provided quick navigational capabilities. We went through a selection process to find the best data catalog for our use case. In doing so, we collected feedback from business stakeholders who expressed their desired end-state for a data catalog, and then began to evaluate tools based on those requirements. Here’s a non-exhaustive summary of our criteria:
- An easy to navigate interface, intuitive enough for newly onboarded employees
- A strong search capability with the ability to filter on all assets across various sources (dbt, Looker, Snowflake)
- An automated crawler that pulls information into the data catalog on a schedule
- A clear, consolidated and concise definitions/glossary section
- Permission handling
- A table preview and SQL component
- Data lineage visualizations (showing the downstream and upstream flow of data)
Atlan was our favoured tool. Most tools that we evaluated met our basic requirements, though due to the novelty of data cataloging, we noticed a lot of “roadmap discussions” about forward-looking feature add-ons that we could expect in the future…but not yet. Our final decision prioritized the less commonly available, yet highly useful, features of a data catalog so that we could benefit from day 1. These features were: data lineage, user permission settings, and a glossary. Data lineage from initial ingestion to final report is exceptionally helpful when updating code, fixing bugs, onboarding, and deleting unused assets. We love it! User permissions enable us to restrict and enable access depending on the asset’s sensitivity level. An obvious win. And finally, the glossary enables us to host stakeholder-verified definitions for metrics in one place. It’s a Data Governance Manager’s dream.
It’s a Trade Off
While the benefits of data cataloging are clear, it begs the question, why don’t more companies choose to catalog? It’s all about implementation. The cost of implementation is not one to under evaluate. It takes significant time and effort to prepare a data catalog for regular use. This preparation includes, at the bare minimum, the building of data definitions and glossaries for all common tables and metrics in your database.
In our scenario, it was the Data Analysts and Engineers who populated this information, and our business stakeholders who reviewed it. In terms of documentation processes, we chose to write our data definitions using internally administered tools such as dbt and Looker, and then run a crawler to pull that data into the catalog. This way, we avoided having mismatched documentation across tools. Since our team already maintained thorough documentation in dbt, we had a huge head start. By distributing all additional documentation responsibilities across the team, each contributor only spent a few hours to populate the previously undocumented definitions. Though set up was laborious, we were prepared.
Our team decided to start cataloging early, and it has paid off! As the company scales, so do its data assets! By having proper data documentation now, we only need worry about maintenance moving forward. And luckily for us, maintenance is easy since it occurs downstream at the data modeling stage. Creating the data catalog cost us time that would have otherwise been spent furthering our analytics projects. We were, consequently, willing to make this trade-off because we recognized that implementing a data catalog further down-the-line would take even more time. Why not start off on the right foot, and reap the added benefits earlier on?
Learnings to Pass On
Here are three learnings that we’d like to pass on about data cataloging.
- This tool was more useful to the data team than expected. Many internal questions can now be answered with the share of a link to our business stakeholders. The tool has enabled self-serve solutioning as we’d hoped. While business users mostly leverage the glossary, our data team benefits from knowledge sharing across business domains and lines of business. Whereby shared metrics are tagged and tables are easily queried by leveraging the lineage and column definitions provided in the tool. Essentially, you no longer need to make the data model or speak to its owner in order to understand and query a table in our database.
- Having all documentation about our database in one location makes finding terminology easy-breezy.
- This is not click and play. Substantial effort is required to set up a comprehensive data catalog, and it takes initial dedication to point business stakeholders towards the tool so that it becomes a habitual part of their routine when trying to answer data-related questions.
For more articles about technology, visit the Snapcommerce Medium homepage.
Thanks to Snapcommerce for writing this amazing article! 💙
This article was originally published by Snapcommerce on Medium.