Demystifying the Active Metadata Management Market
The Active Metadata Pioneers series features Atlan customers who have recently completed a thorough evaluation of the Active Metadata Management market. Paying forward what you’ve learned to the next data leader is the true spirit of the Atlan community! So they’re here to share their hard-earned perspective on an evolving market, what makes up their modern data stack, innovative use cases for metadata, and more!
In this edition, we meet Gu Xie, Head of Data Engineering at Group 1001 and two-time user of Atlan, who explains Group 1001’s unique structure and how that affects their data needs, his hard-earned perspective on the active metadata management market, and how he’ll use Atlan to drive productivity and clarity across his organization.
This interview has been edited for brevity and clarity.
Would you mind describing Group 1001 and your data team?
Our organization is the data engineering team. Group 1001 is an insurance holding company that actually is an umbrella company of several different brands, including Delaware Life, Gainbridge, Clear Spring Life and Annuities, and several others.
What we’re focused on within our team is the annuity side of the business. So we directly interface with our core policy management system for provisioning and handling all of the annuities business. Our engineering team is responsible for ensuring that we can provide analytics, whether it be on the data that’s within the annuity side of the business to our operations team, or from a sales perspective, or from a marketing perspective.
Each business is a little bit different. Gainbridge is a direct-to-consumer business brand, whereas Delaware Life revolves around a more financial advisor-level business where we’re doing more of B2B2C. So two different businesses, different brands, different products, but we’re providing the breadth of analytics across those perspectives.
And how about you? Could you tell us a bit about yourself, your background, and what drew you to Data & Analytics?
I’ve been working in data engineering and data & analytics since the very start of my career. I’ve been in this industry for… gosh, I think it’s about 11 plus years now.
Right out of college, I had a really good opportunity to dive into the world of CRM, but ended up doing anything but CRM and focused more on the data itself. Whether it be building out business intelligence, doing report migrations, doing data migrations, tons of work in terms of leading data warehouse teams, as well as leading and driving the modernization of modern data & analytics platforms as organizations moved to the cloud. That’s where I’ve built my core competency; really enabling and stitching together this modern data stack for an organization, such that they can get really comprehensive data & analytics capabilities without hiring a massive team.
So I’ve done this before in my prior organization with a team of 40 plus engineers. In that organization, we chose then implemented a traditional data catalog, but spent a ton of engineering hours integrating it, then had trouble getting it adopted by consumers and stewards. We weren’t very happy with it. Then we migrated to Atlan and had much better luck activating the data stack we all built together.
Here at Group 1001, we’re able to build an entire end-to-end data analytics platform in under 10 months with a team of four. That just goes to show, if you have a really strong mental model of this modern data & analytics stack, and knowing where your organization will need to fit and piece things together, you don’t need to have a massive engineering team. You can have a really small team that can really build and enable this.
We’re leveraging a lot of CI/CD and automation, and at the same time, are able to get the benefits of the modern data stack, which is incredible end-to-end velocity from idea to insight. That’s the focal point of the vision: Idea-to-insight, and getting velocity there.
What does your stack look like?
We have data sources whereby data resides in databases, file logic storage, SaaS applications like Zendesk, Google Analytics, and Salesforce. We have APIs, whether it be internal APIs or events and logs.
The way we started with this tech stack, we built around Snowflake as our core data platform. We were on GCP, so we did extensive POC between BigQuery as well as Snowflake, and ended up choosing Snowflake.
Then we ran into a situation whereby, “Okay, we need to replicate our data into Snowflake,” because in the past we were building ETL pipelines forward into Postgres initially, and it just doesn’t scale. So we leveraged Fivetran as both our CDC replication as well as SaaS replication. So we can access the data from the database side of the fence, as well as tap into all the different SaaS applications that Fivetran supports. So we can onboard Google Analytics, Zendesk, Google Ads, as well as Salesforce data onto Snowflake to have that holistic centralization of all of our data and assets.
Then we also went down the path of, “We need to model and shape this data so we can be readily available for analytics and unify the data model across our various lines of businesses.” So we brought in Coalesce because that gave us the scale, the standardization, the automation that we need in order to create the data models and shape them for consumption. On top of that, we brought in Dagster as an orchestrator to fully replace Airflow. After setting up the infrastructure, one week later, three days after that, we migrated all 73 DAGs over to Dagster from Airflow. That was just huge.
We then also have Soda for building various data quality rules to ensure we have all the monitoring in place, and what the quality criteria are, and integrity, completeness, freshness, those kinds of components. We use Soda to enable our team to build quality rules. And then where Atlan comes into the journey. We see it as part of our data management suite. Soda from a quality monitoring perspective, as well as Atlan to enable data discovery.
So an engineer, or an analyst, or even a business user can find out what data we have in the organization, who owns it, what it means, when it was last refreshed, and if it can be trusted. And also where is it being used and how is it being sourced? Atlan provides that holistic picture of that journey.
In terms of the analytical outputs, we use PowerBI in our current reporting platform. We also brought in Sigma for embedded and exploratory analytics use cases.
Why did you need an Active Metadata solution?
That’s the hardest sell: “Why do we need a catalog solution? Why do we need an Active Metadata solution?”
And the way I approach this problem is just due to the underlying need. Data is always going to grow 2X every two years. That’s been the industry trend since the 1970s. Data grows twice every two years.
So the problem that I see is as more data grows, there’s more metadata of that data, and that could be in the form of more database objects that you’re going to create, more files that you have to process, more sources that they ingest. Especially when you include more systems that you have to support, more BI tools that you have to enable, more anything. Think about that, doubling the data. The metadata is a magnitude-like factor on top of that.
One of the biggest struggles in any data team is answering questions to and from a business user perspective, “How do I find X, Y, Z data? Where do I get this? Where do I find this report?” And even if data teams do have that, they’ll ask, “Well, where’s it coming from? How do I get the underlying detail of that information?”
And when something goes wrong, which it inevitably will, “How do I troubleshoot that?” And my experience is that if there’s one little column on that report in PowerBI that is broken, a user will come and ask me, “Okay, what happened?”
And I don’t know, so I have to dig in. So you open up the report, and it’s an archeological exercise to excavate from the report to the pipelines, to the data sets, to the online source data to figure that out.
That’s always been a challenge. And that in my opinion, is the true technical debt that weighs on every single data team out there. It’s the fact that there’s never a good way of handling that metadata. And it rears its ugly head, just like every tech debt does, in the form of the team spending 80% of their time doing this, answering questions about the data, figuring out how people get access to data, and troubleshooting.
I’ve seen the data teams can spend upwards of 80% of their time in reactive mode. And if you average it out, I’ve seen it’s usually about a good 40% or 50% of their time is spent answering questions. And that is a fundamental sink across all developer productivity in the organization.
How do you get more velocity? That’s where Atlan comes into play. Maybe we can enable a business user to answer the question themselves, or someone like a data analyst would be able to answer a question without involving engineering teams.
An engineering team can then focus on what they’re really supposed to do: Acquire more data, enable more insights, and sit down with the business users that can help collaborate in that discussion about, “Hey, I have this idea, how do I enable this insight?” Rather than spending time answering the question of, “What went wrong here?” So that’s the way I see it, that’s the need, and to sell that need can be difficult.
I brought in Atlan because it will help our team be better at handling data. Once we onboard Atlan, that’s the productivity I want to get to, teams spending less time answering questions, and spending more time collaborating on data.
We’re also using Atlan as a way of creating an authoritative set of datasets so users would know which data they can trust and use. We’re expanding our team to collaborate with other business groups such that they can self-service their data analytics and Atlan will be key to enable the collaboration model between engineering and business.
What made Atlan stand out in the market to you and your team?
Here’s the problem that I see in the marketplace. Every single catalog solution seems focused on just the catalog, or they focus on other product lines that are extensions of the catalog. In the case of traditional data catalogs like Alation, they focus on the fact that, “Hey, you can democratize data stewardship across the organization. Your whole organization could be stewarding data.” That was the genesis of it. So it’s the Wikipedia approach of data stewardship.
The reality is, there’s no team out there that has a data steward. Maybe in a large organization you have a few of them, but that’s not a role that you want to hire. What’s the value add, what’s the ROI for the data team, or from a data governance perspective?
In the past, I worked at a large Financial Services firm, and we experienced all the challenges involved with a traditional catalog. We would spend a ton of engineering hours integrating to our existing systems, and then we would need an army of data stewards to build and maintain everything.
The reality with this approach is that you’re forcing data stewardship across every organization and they just don’t have the bandwidth to do it. That’s why I saw a huge retraction from Alation, with people going to use Confluence pages because it’s just easier to edit Confluence than to update a catalog.
So I knew there had to be a better approach to this problem, and that’s when I came across this article about “Data Catalog 3.0” by Prukalpa, and I was intrigued by this new approach. And I chose Atlan not just now, for Group 1001, but back in my previous role, too.
So one of the main reasons why I chose Atlan is that Atlan is focused on a very strong mission. That’s the core of it. Yes, it’s Active Metadata Management, but the real kicker of that is Atlan’s vision is data collaboration between engineering, analysts and business teams.
Alation is not that. Their business model is to catalog the data in their system, and that way they could sell you on the Composer (a SQL editor). That’s the bread-and-butter moneymaker, from what I’ve seen. Their core product of enabling the cataloging solution? They’ve never improved, and they focus on Composer. I didn’t like that from a product development perspective.
And with Atlan, I see their journey is really enabling collaboration with data, whether it be simplifying the amount of work from an engineering perspective to onboard the various data tools into Atlan. Or if it’s from an analyst perspective, being able to see the online data sets, see the lineage and leverage it, understanding where a dataset has been, or integrating Slack to enable that communication about data across the organization.
So that’s what I focus more on, number one, is the product vision and what their main mission is. And secondarily, on top of it, is just seeing the proof in the pudding, the developer velocity.
I know that in my previous organization we spent a ton of engineering hours to integrate our existing systems to a traditional data catalog. With Atlan, I was able to get Group 1001 up and running in under two hours. So just the developer velocity of not having to spend all that time configuring and building integrations because Atlan has out-of-the-box integrations to a lot of the core modern data stacks? This is huge.
We could focus more on the higher-value ask, and the higher-value ask is to enable better collaboration within the organization around data. That’s the real reason why I chose Atlan.
What do you intend on creating with Atlan? Do you have an idea of what use cases you’ll build, and the value you’ll drive?
The use case that we have Atlan using right now is not the only use case that we eventually want to build in the future. And the reason why is right now, we’re really focused on our core analytics stack, which involves Snowflake, Fivetran, Coalesce, Dagster, and the like. Sure, Atlan will solve that, but how do we extend Atlan across the enterprise? So enabling cross-enterprise data governance, a holistic view of our enterprise’s data assets, tracking PII and applying governance and policies related to it.
Any new business that we’re onboarding can come with their own data stack. So one of the core components from a data strategy perspective, is that we can leverage Atlan as a central governance framework. That all organizations will publish data assets into Atlan to have one, holistic umbrella.
Another key use-case is enabling self-service of analytics across our organization. We plan to leverage Atlan to document our newly curated data so other departments can discover, understand what the dataset is, how to use it, and whether they can trust the information. This will be key to facilitating the collaboration with data and enabling our organization to be data centric.
Photo by Benjamin Child on Unsplash