Scaling Data Collaboration, Governance, Quality, and Ownership Across 60 Data Teams
At a Glance
- Autodesk, a global leader in design and engineering software and services, created a modern data platform to better support their colleagues’ business intelligence needs
- Contending with a massive increase in data to ingest, and demand from consumers, Autodesk’s team began executing a data mesh strategy, allowing any team at Autodesk to build and own data products
- Using Atlan, 60 domain teams now have full visibility into the consumption of their data products, and Autodesk’s data consumers have a self-service interface to discover, understand, and trust these data products
A data platform today needs to have a number of core features. It needs to be multi-domain, and it needs to support data from many different parts of the business across many different subject areas. It needs to be multi-tenant, and we have to enable multiple teams to work on the platform, securely and in isolation, only sharing when they choose to, which leads to security. The platform has to protect data, especially our most sensitive customer data. It is compliant, meets privacy requirements, supports discovery, and has high velocity and high quality tooling for common extract, load, and transform operations.”
Mark Kidwell, Chief Data Architect, Data Platforms and Services
Founded in 1982, and since growing to $5 billion in annual revenue and nearly 14,000 employees, Autodesk affected seismic change for architects, engineers, and designers when it introduced Computer-aided Design. In the decades since, the company has grown into a leading, cloud-first technology company, offering dozens of products and services, supporting diverse users from Media & Entertainment to Industrial Bioscience.
“A lot of folks may know Autodesk as the AutoCAD company, or might have used it in the past for design in architecture engineering, or construction. It’s moved way beyond that. Those are our roots, but we now provide software, and empower innovators with all sorts of design technology, in addition to product design and manufacturing,” explained Mark Kidwell, Chief Data Architect, Data Platforms and Services at Autodesk.
Underpinning this transformation, from AutoCAD pioneer to Nasdaq 100 technology leader, is data-driven decision-making, powered by a visionary data team, and modern data technology like Atlan and Snowflake.
Joining Atlan at the 2023 Snowflake Summit, Mark shared with the Snowflake Community how their team overcame the challenge of scaling data collaboration and governance across 60 data teams with distinct ownership models, and used Atlan to help them build the data mesh that was right for them.
Autodesk’s Analytics Data Platform
While the Analytics Data Platform Group’s mission of enabling analytics is simply summarized, the team’s responsibilities are vast and complex. Their services include maintaining a number of core engines, data warehouses, data lakes and metastores. They provide ELT services, as well as ingestion, transformation, publishing, and orchestration tools to manage workloads, and analytics services like BI layers, dashboarding, and notebooks. And to coordinate these services, they drive a set of common tooling that enables data governance, discovery, security monitoring, and DataOps processes like pushing pipelines to production.
“We power both BI Analytics as well as a ton of ad-hoc analytics,” Mark shared. “We’re also used more for process reconciliation, an integration layer for a lot of data, and we can also power a single view of customer use cases. We’re enabling teams to push data to downstream systems after building data products on our platform. And finally, everyone’s favorite topic, AI and ML are a feature of the platform, as well.”
Autodesk’s Analytics Data Platform begins at the source, with typical enterprise systems like CRM, HR, and finance systems, and marketing automation. More unique to Autodesk is data related to their products and services, like subscriptions and licensing, product usage, or Platform APIs. Being a cloud-first business, most of these systems and sources are API- or event-based, requiring ingestion tools like Fivetran, Matillion, AWS Streaming, and Apache Spark.
“We use a combination of a data lake and a data warehouse. Our data warehouse is Snowflake, the data lake is AWS, and of course, all the technology sits on top of the lake and warehouse to run transformations, queries, and analytics,” Mark shared. “We’ve adopted a lot of the tools and technologies that are part of the Modern Data Stack, but we have a lot of use cases that require us to maintain the data lake for our high volume and high velocity data sets that generate events.”
Rounding out their modern data stack are a series of technologies they refer to as their access layer, like Looker, PowerBI, Notebooks, and AWS Sagemaker, as well as Reverse ETL tools to push data back into other systems.
Choosing Snowflake to Supercharge Business Intelligence
In 2019, Autodesk’s Analytics Data Platform utilized only a data lake, making it difficult for their users to consume data, or to build reports and dashboards. Focusing on Business Intelligence use cases, Mark’s team first adopted Snowflake to power analytics, leaving existing ingestion processes the same.
Still experiencing issues upstream across ingestion, transformation, and workflows, Autodesk then moved to make these processes more reliable, introducing Fivetran, Matillion, and no- and low-code tooling, replacing legacy, hand-coded ingestion processes with modern, off-the-shelf tools, improving reliability.
Having introduced Snowflake as their data warehouse to simplify reporting and dashboarding, and having modernized their ingestion process, Mark’s team began to see an opportunity to implement Data Mesh.
“If we could do this ourselves, why couldn’t other people do this on our platform? This was the start of our data mesh approach. Could we take the tech stack that we built, and let other people build using the same technologies we’d been using for ingestion, publishing, and consumption?”
Growing Demand for Data Drives a New Approach
Autodesk began evaluating the data mesh concept, defining a problem set, determining goals, and making certain they understood alternative approaches.
“This problem of demand for data products and how we scale that? We were facing this exact issue,” Mark explained. “There was no way we could ingest all the data that we had in our backlog, even after the introduction of all these new tools and technologies that greatly accelerated things. A central data team was not going to be able to ingest all the data sources that we needed.”
By the start of 2021, the amount of data in Autodesk’s backlog for ingestion was larger than what had been ingested in the history of the Analytics Data Platform team.
“The few datasets we’d already brought in, like Salesforce, or some of the other marketing automations, were just a drop in the bucket compared to the customer experience analytics datasets, the customer success datasets, or our cloud cost and consumption datasets. All these other data that people wanted to bring into the platform,” Mark explained.
Demand for data was growing exponentially, the data team’s ingestion backlog was larger than what the platform had ever ingested to that point, and the team, itself, was far too small to manage it by themselves. And despite the work that had already been done by choosing and implementing Snowflake and a more modern data ecosystem, increasing the velocity and quality of data brought into the Analytics Data Platform, technology gaps, especially to support less technical teams, still persisted.
“Where Data Mesh could help us was by enabling any team throughout Autodesk to act as a publisher, to ingest their own data, and to present it to consumers for that data domain. That became our next goal,” Mark summarized.
Bringing Data Mesh from Concept to Reality
Over the course of their previous work, the Analytics Data Platform team had already made progress against Zhamak Dehghani’s four core pillars of Data Mesh, but in order to further translate these concepts into a strategy that met their needs, the team began a gap analysis to see where they could improve. Moving pillar-by-pillar, Mark’s team began mapping potential improvements to their two key audiences: Producers and Consumers.
Decentralized Domain Ownership
The first pillar, Decentralized Domain Ownership and Architecture, ensures that the technology and teams responsible for creating and consuming data can scale as sources, use cases, and consumption of data increases.
“We had a long history of supporting data domains and different teams working on the platform, owning those domains. They were acting relatively independently, and perhaps too independently,” Mark shared. “A real challenge for us was finding data that these domain owners had brought into the system. And if you were a consumer with an analytics question, a common complaint was that they had no idea an asset was there, or how to find it.”
Data as a Product
The second pillar, Data as a Product, ensures data consumers can locate and understand data in a secure, compliant manner across multiple domains.
“A consistent definition of a data product meant defining what teams are expected to do in terms of defining product requirements, or what they are expected to do in terms of meeting data contracts and SLAs,” Mark explained. “We would have to move from teams that were simply ingesting data, and toward teams that were thoughtfully publishing data on the platform and thinking about what it meant to their consumers to have these data.”
Self-service Architecture
The third pillar, Self-service Architecture, ensures that the complexity of building and running interoperable data products is abstracted from domain teams, simplifying the creation and consumption of data.
“There are so many ways to define self-service. You could say we were self-service when we had Spark and people could write code,” Mark explained. “We were definitely better at self-service once we adopted no-code and low-code tools, but even if you used all those tools directly, there was no guarantee you would get the same results. Different teams might use them, and it results in a completely different data product. So we wanted to make sure that not only were we using self-service at the tool level, but we were providing frameworks or other reusable components.”
Federated Computational Governance
The fourth and final pillar, Federated Computational Governance, ensures the Data Mesh is interoperable and behaving as an ecosystem, maintaining high standards for quality and security, and that users can derive value from aggregated and correlated data products.
At the time, Autodesk was early in their data governance journey, making it difficult for the platform team to understand how their platform was used, for publishers to understand who consumed their products, and for consumers to get access to products.
“We couldn’t move forward with a lot of other things we wanted to do if we didn’t have a stronger governance footprint. This led to a series of workstreams for us, and a more crisp definition of who the different personas and roles using the platform were.”
Defining Workstreams to Support Publishers and Consumers
The Autodesk team began by formally defining the roles of publishers, consumers, and the platform team, then defined workstreams that improved discrete parts of the Analytics Data platform, organized by the persona they would benefit. Top priority was given to workstreams that would benefit publishers, including platform-wide standards, and the processes and tools necessary to easily ingest and publish secure, compliant data.
Consumer workstreams focused on trust, ensuring that sensitive data could be shared on the platform, and that they had the tools they needed to discover and apply data. Finally, Data Platform workstreams ensured that Mark’s team could enforce quality standards, and understand data product consumption and its associated costs.
To this point, the Analytics Data Platform team was responsible for data engineering and defining product requirements, and knew the tools, data, and consumers for the data products that they built. But to drive trusted data at scale, each publishing team would need to learn these skills, as well.
“We don’t scale this by scaling up the core team. We had to enable other teams to do all these things,” Mark explained. “It meant that instead of [only] the core platform team knowing and using the tools to deliver products directly, we had to enable publisher teams to have their own data product owners and their own data engineers.”
Each of Autodesk’s publishing teams would need to define a Product Owner and Data Engineers. Product Owners would ensure that consumer requirements were understood, and Data Engineers would have the necessary expertise to use platform tools, and ensure high technical standards. Repeating the process across one publishing team after another, the Analytics Data Platform team would provide the tooling, standards, and enablement necessary for each publishing team to be successful.
Just two years later, Autodesk has successfully ingested dozens of data sources, and has built numerous data products, all delivered by either individual teams, or combinations of teams building composite data sources from multiple domains like Enterprise and Product Usage data.
Since we started the self-service initiative, we’ve had a total of 45 use cases that have gone through since 2021. It’s not something that we could have done if we just had one core ingestion team; one core data product team.”
Mark Kidwell, Chief Data Architect, Data Platforms and Services
Bringing Data Mesh to Life with Atlan
With data publishers now building products, following the standards and rules of the platform team, utilizing modern tools, and performing quality checks, Autodesk’s focus moved to better enabling their growing base of data consumers.
These data consumers, like analysts and engineers, needed a simple way to discover data products. Alongside discovery, they often had similar needs, like understanding the business context of data products, their lineage, and how products are composed so they could ask pointed questions about their trustworthiness. If these questions weren’t easily answered, consumers would need to know the ownership of each data product.
“We needed something that could help bridge the gap between publishers and consumers, so we adopted a data catalog. Atlan is the layer that brings a lot of the metadata that publishers provide to the consumers, and it’s where consumers can discover and use the data they need,” Mark shared.
While Atlan would become Autodesk’s catalog of choice, and a long-needed bridge between consumers and publishers, the Analytics Data Platform team had three previous experiences with data catalog technology.
Autodesk’s first attempt was a home-grown data catalog, essentially a view of a Hive metastore with basic search functionality, limiting its usefulness to data teams, and its accessibility to data consumers.
“We had a number of false starts looking at data catalog technology. And (the technologies) we were looking at in 2020 just didn’t seem to work well enough to migrate off of what we were already doing,” Mark explained, referring to their search to replace their homegrown catalog.
Autodesk’s third attempt took the form of Amundsen, an open-source data discovery and metadata technology.
“When we got to our data mesh initiative in 2021, we decided to select Amundsen. It was a huge step up from our homegrown catalog. We could actually see data in Snowflake, and it had a decent search feature,” Mark shared. “Some of the drawbacks though, being open-source, were a lot of gaps in functionality. It turned out to be a lot of work adding basic features that we needed like the ability to update metadata by a data owner, and we had to build our own UI to do that, or to add things like lineage. If we wanted to do that with Amundsen, it was an investment.”
In 2022, seeking a data catalog to better support data mesh, Autodesk selected Atlan, now available for 120 active users that benefit from an out-of-the-box integration with Snowflake, Autodesk’s data lake, and custom metadata related to data quality and ownership.
“Our future phases are to continue to build upon that. We’ll keep enabling further enrichments and additional data sources, and also getting data that’s published by Atlan back out, and feeding other systems,” Mark explained.
Among the most important reasons that Autodesk chose Atlan was out-of-the-box support for data sources and the interaction features they expected in their prior data catalogs.
“After going through this with an open-source catalog and seeing the issues, we didn’t want to fight this fight again, so we chose things that worked and integrated very cleanly with our data stack,” Mark shared. “We wanted something that was very accessible, something that had API access that we could enrich with our own metadata as well as getting data back out. We also wanted something with a much stronger user experience, so folks could come in and leverage the catalog almost as a data portal. It could be the primary starting point to find the data they need and immediately start using it.”
Buy-versus-build economics were another consideration, with open-source solutions requiring investments in software engineering, and significant delays rolling out functionality. And with a growing diversity of roles utilizing Autodesk’s data mesh, Atlan promised fit-for-purpose experiences for consumer, publisher, and platform teams, alike.
Atlan can tell publishers the usage of the tables or data products that they build. Of course, it helps consumers find data and understand more about the data that’s trustworthy. And for the platform team, we can have visibility into all of this, we can understand now, what actually is being used in the platform, what’s popular, what’s not. All things that weren’t possible before.”
Mark Kidwell, Chief Data Architect, Data Platforms and Services
A Modern (meta)Data Stack
As Atlan was added into the technology supporting Autodesk’s growing data mesh, the team realized the potential of the metadata that their data platform, itself, was generating, and decided to capture that data, load it into Snowflake, and publish them as data products.
“A few of the key sources of data are tenants and ownership, and one of the key things for administrators is understanding who owns data sets. It’s also a core need for understanding approval workflows and cost attribution,” Mark shared.
Usage and Consumption metadata also unlocks crucial use cases for the platform team, driving understanding of the usage of resources like data assets or cloud resources, and attributing them back to the tenants and teams that publish to, and consume from the platform.
Autodesk’s teams that are responsible for building data pipelines now use Atlan to understand process and query history, and are using a much richer view into the data platform for debugging and understanding how their pipelines are performing. And Autodesk’s data quality metrics, powered by the same pipelines and flows, are used to further enrich data assets in Atlan.
“When we take a lot of these metrics, or other data products, or the metadata that we build, we use those to enrich data assets in Atlan,” Mark explained. “Atlan, itself, now becomes a primary consumption layer for consumers and publishers that want to understand these important details around their processes and data assets.”
Lessons Learned
A Platform + Enablement Mindset
“Data Mesh isn’t necessarily an outcome. It’s not technology, and it’s not prescriptive. It’s a lot of ideas. They’re great ideas, and we had to do a lot of work to understand what those meant. And in the end, it helped us move toward a mindset of platform enablement.”
No “One Size Fits All”
“There are no silver bullets. Expect a lot of work making implicit or tribal knowledge explicit and documented. And what’s worked for us doesn’t work for others, necessarily. It’s important that folks adopting data mesh really consider their requirements. Some teams might not even need data mesh.”
Skills Gaps Will Exist
“Even as we’ve adopted this, there’s still a lot of gaps, both on centralized and decentralized teams. There’s a lot of different skills that are now distributed, and different teams have to pick those up. It’s an ongoing process and it just needs to be baked into the migration or transformation.”
Metadata Management Needs Data Teams
“All those additional metadata sources that we brought in? The source owner for a lot of those things happens to be the platform team, making it the team that’s responsible for ingesting. So the platform team is now responsible for both producing tools, and for using those tools. We face the same skills gaps, and we have the same issues getting these things to work, finding the right people, and building.”
Drink Your Own Champagne
“We use our own tooling to power our platform. We drink our own champagne. I like that, because we wanted to focus on the customer, and the customer is also us.”
Photo by ThisisEngineering RAEng on Unsplash