When I joined Postman’s data team a little over a year ago, our data was largely a mystery to me. Every day, I’d post questions on Slack like “Where can I find our MAU (monthly active users)?” Someone would tell me where to get it, but as I dug further, I’d find MAU data in other locations. And sometimes the different locations contradicted each other.
Over time, I learned how to navigate Postman’s wealth of data—which tables had different versions of the same data, or different filters, or sync issues. But this didn’t stop with me. As the data team scaled nearly fivefold in one year, this issue came up again and again with each new team member.
At the time, Postman’s data system was fairly simple. We had a set of data tables, and information about those tables lived in the heads of our early data team members. This worked when the company and its data were small, but it couldn’t keep up as we started to grow exponentially.
Postman currently has hundreds of team members distributed across four continents, and more than 17 million users from 500,000 companies using our API platform.
From the start, Postman Co-founder and CTO Ankit Sobti wanted to make sure that data was democratized. He used to say that it’s difficult for a data team to sit and churn insights day in and day out. Instead, he staunchly believes, everyone in the company should be able to access our data and gain insights from it. This became especially important in 2020 when Postman continued to scale while going fully remote during the COVID-19 pandemic.
To address this issue, the data team and I decided to take on Postman’s data system as a project last year. Our goal was to make Postman’s data easier to access and understand, both for new hires within the data team and for people across the company.
Modernizing and democratizing a large-scale data system is a big challenge, and we’re definitely not the only company trying to crack it. So, in the hopes that our experience may help others trying to deal with the same challenges, I now want to share how we went about this project, what worked and what didn’t, and what we’ve learned so far.
Where we started—the challenges of Postman’s data stack
At Postman, we’ve implemented a modern data stack. Data engineers bring data into Redshift, Amazon’s cloud data warehouse. Then our analysts transform the data with dbt, our SQL engine, and create dashboards and Explores on Looker.
Currently, we have about 170 active users per week on Looker. That’s a lot for a company of around 400 people, but it’s not yet achieving our goal for everyone to be able to use our data.
One of the main issues we were facing was the lack of consistency when providing context around data—making context the missing layer in our data stack. As Postman grew, it became difficult for everyone to understand and, more importantly, trust our data.
We had been creating dashboards and visualizations based on requests from across the company: whatever people needed, we designed. However, the metrics on those dashboards often overlapped, so we had inadvertently created different versions of the same metrics. When it wasn’t clear how each metric was different, we lost people’s trust. (And as the saying goes: building trust is hard, but losing it is easy—it just takes one mistake.)
The data team’s Slack channel was filling up with questions from other teams asking us things like “Where is this data?” and “What data should I use?”
Our experienced data analysts spent hours each week fielding these questions from new hires. Of course, this was frustrating for all involved. But we also realized there was a larger problem—it would be a disaster if any of our analysts left the company, since so much information was stored in their heads.
In our data team’s sprint retrospectives, we realized that Postman’s data system needed help, so we embarked on a project to democratize our data and fix discoverability. Our goal was to create more time for our team and more trust within the company.
Solution #1: Documenting our data with Confluence
Instead of embarking on a massive overhaul of our data system, we decided to start smaller, implement a solution quickly, and see what we could learn from it. Postman was already using Atlassian, so we started by creating a Confluence document.
Before, all of our data questions and answers were stored in Slack. Slack can be hard to navigate and search, so people were asking the same questions over and over. It’s easy enough to answer one or two questions on Slack, but 20 or 100? It’s just not scalable.
Going forward, our goal was to make our new Confluence document a single, searchable source of truth.
Whenever something came up multiple times on Slack, we put it on Confluence. For example, when someone asked, “How do you calculate MAU?” we added the table and calculations to the document. When multiple people asked us for the same metrics, we also added those stats and charts.
Solution #2: Creating a data dictionary with Google Sheets
Our Confluence document was a good start, but like Slack, a single document just couldn’t scale as quickly as we were. Our next idea was to create a data dictionary in Google Sheets.
This seemed fairly straightforward. We first got all our table, schema, and column names in one place. Then, for a few sprints, we assigned everyone in our data team to document five tables each. Each person set aside a couple of hours to write down everything they knew about their data tables in the Google Sheet.
We also included reviews in this process. After each person documented their tables, someone else in the data team would read through their work. If it seemed clear, they’d say it was good to go.
It was a good idea, but we ran into challenges executing it:
- Low-quality documentation: At the time, our data team had nearly 20 people in it, but only three or four of them had been with Postman for more than a year. These veteran team members couldn’t document all of our data, so everyone chipped in. However, some of the people who were documenting our data didn’t actually know much about it. They hadn’t set up the data, and they weren’t the owner of the data table. Our newer team members would add whatever they understood, but it didn’t always give a complete picture of the data table.
- The new data dictionary also had trouble with scale: We had nearly 20 data team members trying to work on the document. With that many people writing, editing, and commenting at the same time, it quickly became too much to handle on a Google Sheet. And that was just the data team. We wanted to eventually open the data dictionary to the entire company, but we couldn’t figure out how to keep our documentation secure and tamper-proof with hundreds of users.
Solution #3: Implementing a pre-built data workspace with Atlan
After trying to build our own solution twice, we started to look for a pre-existing product that we could adopt. That’s when we found Atlan, a modern data workspace, which seemed like a clear solution for our data-discovery problems.
On Atlan, we’ve been able to catalog and document all of our data, and its catalog acts as a single source of truth for our data. The catalog includes multiple levels of permissions for different types of users within and outside of the data team, so everyone can search for and access data without having to message the data team.
The result? Everyone is able to find the right data for their use case, and the data is consistent across the board for all accessing it. The clearest outcome is that everyone is finally talking about the same numbers, which is helping us rebuild trust in our data. If someone says that our growth is 5%, it’s 5%.
Our new data workspace has been a success for us because of its clean interface and powerful functionalities—providing documentation, ownership information and usage, and data discovery.
Moving beyond data discovery
At Postman, we needed to address issues around data discovery and context because they became major problems as we scaled. But as we’ve solved these problems, we realized that we’ve inadvertently set ourselves up to now take on bigger data challenges.
For example, now that we’ve set up a system to track our data, we can use it to understand our data lineage—including where each data asset comes from, and how they’re all connected to each other.
Data lineage can be really useful for a couple of reasons. First, knowing how our data is connected helps us solve our daily bugs and issues quicker.
Second, we’ve found that having a single source of truth for our data helps our data team as we grow. Every time we change something or add something new, it’s important to check how it will affect everything else in our data system. Instead of posting a question on Slack, we can check data lineage and find everything we need to change or update.
Lineage is just one avenue that we’ve seen open up as we’ve improved the way we document and catalog our data. We’re also taking steps toward maintaining more consistent data descriptions across tools, improving our data quality, and more.
In the long run, we see work on improving data discovery as a foundation for democratizing data across the company. Having a reliable data foundation, where people can find and understand all our data, opens the possibility of having everyone participate in analyzing data. This would enable our entire company to become more data-aware and data-driven, which is the goal for any major company today.
What we learned from the last year
As I think back on our efforts to improve our data stack, there are a few learnings that stick out—things that we managed to get right the first time around, and also things that I wish we had done differently:
- Start small and build to bigger solutions: When you’re taking on a big data challenge, it’s easy to jump to thinking about equally big solutions that take lots of resources or money. However, starting with smaller, quicker solutions helped us understand more about what works and doesn’t work for us. Then, when we ventured into a more comprehensive, paid solution, we knew exactly what features we were looking for.
- Pay attention to scale: Our first two solutions were solid ideas, but they failed to keep up with our team and data as we scaled. That’s why it’s important to think about how a data product will scale as your company and data scales. Will it keep up if your data grows by 100x in a year? Can it handle lots of users, all with different needs and access levels?
- One improvement unlocks others: Fixing one part of your data stack lays the foundation for bigger improvements. Data cataloging and documentation aren’t particularly “fun,” but they’re crucial for anything else you may want to do with your data. After all, you can’t take on ML, AI, and the other latest data buzzwords if you don’t understand your data.