Where we started vs. where we are now in the data world
At the beginning of this year, I made some bold predictions about the future of the modern data stack in 2022.
Instead of just kicking off 2023 with a new set of predictions — which, let’s be real, I’m still going to do — I wanted to pause and look back at the last year in data. What did we get right? What didn’t quite go as expected? What did we completely miss?
This time of year, as social media is flooded with lofty predictions, it’s easy to think that the people behind them are all-knowing experts. But really, we’re just people. People who have been buried neck-deep in the data world for years, yes, but still fallible.
That’s why this year, instead of just doing this exercise internally, I’m opening it up to the public.
Here are my reflections on six major trends from 2022 — what I got right and where I went completely wrong.
The verdict: Mostly true ✅ but progressing slower than expected ❌
TL;DR: We did see a lot of market consolidation around the “data mesh platform”, but implementation practices and tooling stack are farther behind the hype than we expected. Data mesh is still on my radar, though, and will stay as a key trend for 2023.
Where we started
Here’s what I said at the beginning of this year:
In 2022, I think we’ll see a ton of platforms rebrand and offer their services as the ‘ultimate data mesh platform’. But the thing is, the data mesh isn’t a platform or a service that you can buy off the shelf. It’s a design concept with some wonderful concepts like distributed ownership, domain-based design, data discoverability, and data product shipping standards — all of which are worth trying to operationalize in your organization.
So here’s my advice: As data leaders, it is important to stick to the first principles at a conceptual level, rather than buy into the hype that you’ll inevitably see in the market soon.
I wouldn’t be surprised if some teams (especially smaller ones) can achieve the data mesh architecture through a fully centralized data platform built on Snowflake and dbt, whereas others will leverage the same principles to consolidate their ‘data mesh’ across complex multi-cloud environments.
(All snippets are from the Future of the Modern Data Stack in 2022 Report.)
Where we are now
My prediction that companies would brand themselves around the data mesh absolutely happened. We saw this with Starburst, Databricks, Oracle, Google Cloud, Dremio, Confluent, Denodo, Soda, lakeFS, and K2 View, among others.
There has also been progress in the data mesh’s shift from idea to reality. Zhamak Dehghani published a book with O’Reilly about the data mesh, and real user stories are growing on the Data Mesh Learning Community.
The result is two increasingly popular theories of how to implement the data mesh:
- Via team structures: Distributed domain-based data teams that are responsible for publishing data products, supported by a central data platforms team that provides tools for the distributed teams
- Via “data as a product”: Data teams that are responsible for creating data products — i.e. pushing data governance to the “left”, closer to the data producers rather than consumers.
While this progress is notable, it ultimately didn’t move the needle far enough, and the data mesh is about as vague as a year ago. Data people are still craving clarity and specificity. For example, in Starburst’s conference on the data mesh, the most common question in the chat was “How can we actually implement the data mesh?”
While I expected that, this year, we as a community would move closer to the “how to implement the data mesh” discussion, we’re still about where we were last year. We’re still in the early phases as teams figure out what implementing the data mesh really means. Though more people have now bought into the concept, there’s a real lack of real operational guidance about how to achieve a data mesh in operation.
This is only compounded by the fact that the mesh tooling stack is still premature. While there’s been a lot of rebranding, we still don’t have a best-in-class reference architecture of how a data mesh can be achieved.
The verdict: Mostly true ✅ but slower than expected ❌
TL;DR: dbt Labs’ Semantics Layer launched as expected. This was a massive step forward for the metrics layer, but we’re still waiting to see the full impact on the way that data teams work with metrics. The metrics layer promises to remain a significant trend going into 2023.
Where we started
Here’s what I said at the beginning of this year:
I am extremely excited about the metrics layer finally becoming a thing. A few months ago, George Fraser from Fivetran had an unpopular opinion that all metrics stores will evolve into BI tools. While I don’t fully agree, I do believe that a metrics layer that isn’t tightly integrated with BI is unlikely to ever become commonplace.
However, existing BI tools aren’t really incentivized to integrate an external metrics layer into their tools… which makes this a chicken and egg problem. Standalone metrics layers will struggle to encourage BI tools to adopt their frameworks, and will be forced to build BI like Looker was forced to many years ago.
This is why I’m really excited about dbt announcing their foray into the metrics layer. dbt already has enough distribution to encourage at least the modern BI tools (e.g. Preset, Mode, Thoughtspot) to integrate deeply into the dbt metrics API, which may create competitive pressure for the larger BI players.
I also think that metrics layers are so deeply intertwined with the transformation process that intuitively this makes sense. My prediction is that we’ll see metrics become a first-class citizen in more transformation tools in 2022.
Where we are now
I put my money on dbt Labs, rather than BI tools, as the leader of the metrics layer — and that turned out to be right.
dbt Labs’ Semantic Layer launched (in public preview) as promised, along with integrations across the modern data stack from companies like Hex, Mode, Thoughtspot, and Atlan (us!). This was a huge step forward for the modern data stack, and it is definitely paving the way for metrics to become a first-class citizen.
What we didn’t get right was what came next. We thought that along with dbt’s Semantic Layer, the metrics layer would be rocket-launched into everyday data life. In reality, though, progress has been more measured, and the metrics layer has gained less traction than expected.
In part, this is because the foundational technology took longer than I expected to launch. After all, the Semantic Layer was just released in October at dbt Coalesce.
It’s also because changing the way that people write metrics is hard. Companies can’t just flip a switch and move to a metric/semantic layer overnight. The change management process is massive, and it’s more likely that the switch to the metrics layer will take years, rather than months.
The verdict: Mostly true ✅ but also starting to head in a new direction ❌
TL;DR: As expected, this space is starting to consolidate with ETL and data ingestion. At the same time, however, reverse ETL is now attempting to rebrand itself and expand its category.
Where we started
Here’s what I said at the beginning of this year:
I’m pretty excited about everything that’s solving the ‘last mile’ problem in the modern data stack. We’re now talking more about how to use data in daily operations than how to warehouse it — that’s an incredible sign of how mature the fundamental building blocks of the data stack (warehousing, transformation, etc) have become!
What I’m not so sure about is whether reverse ETL should be its own space or just be combined with a data ingestion tool, given how similar the fundamental capabilities of piping data in and out are. Players like Hevo Data have already started offering both ingestion and reverse ETL services in the same product, and I believe that we might see more consolidation (or deeper go-to-market partnerships) in the space soon.
Where we are now
My big prediction was that we’d see more consolidation in this space, and that definitely happened as expected. Most notably, the data ingestion company Airbyte acquired Grouparoo, an open-source reverse ETL product.
Meanwhile, other companies cemented their foothold in reverse ETL with launches like Hevo Data’s Hevo Activate (which added reverse ETL to the company’s existing ETL capabilities) and Rudderstack’s Reverse ETL (a rebranded version of its earlier Warehouse Actions product line).
However, rather than trending toward consolidation, some of the main players in reverse ETL have focused on redefining and expanding their own category this year. The latest buzzword is “data activation”, a new take on the “customer data platform” (CDP) category, driven by companies like Hightouch and Rudderstack.
Here’s their broad argument — in a world where data is stored in a central data platform, why do we need standalone CDPs? Instead, we could just “activate” data from the warehouse to handle traditional CDP functions like sending personalized emails.
In short, they’ve shifted from talking about “pushing data” to actually driving customer use cases with data. These companies still talk about reverse ETL, but it’s now a feature within their larger data activation platform, rather than their main descriptor. (Notably, Census has resisted this trend, sticking with the reverse ETL category across its site.)
The verdict: Mostly true ✅
TL;DR: This category continued to explode with buy-in from analysts and companies alike. While there’s not one dominant winner yet, the space is starting to draw a clear line between traditional data catalogs and modern catalogs (e.g. active metadata platforms, data catalogs for DataOps, etc).
Where we started
Here’s what we said at the beginning of this year:
The data world will always be diverse, and that diversity of people and tools will always lead to chaos. I’m probably biased, given that I’ve dedicated my life to building a company in the metadata space. But I truly believe that the key to bringing order to the chaos that is the modern data stack lies in how we can use and leverage metadata to create the modern data experience.
Gartner summarized the future of this category in a single sentence: ‘The stand-alone metadata management platform will be refocused from augmented data catalogs to a metadata ‘anywhere’ orchestration platform.’
Where data catalogs in the 2.0 generation were passive and siloed, the 3.0 generation is built on the principle that context needs to be available wherever and whenever users need it. Instead of forcing users to go to a separate tool, third-gen catalogs will leverage metadata to improve existing tools like Looker, dbt, and Slack, finally making the dream of an intelligent data management system a reality.
While there’s been a ton of activity and funding in the space in 2021, I’m quite sure we’ll see the rise of a dominant and truly third-gen data catalog (aka an active metadata platform) in 2022.
Where we are now
Given that this is my space, I’m not surprised that this prediction was fairly accurate. What I was surprised by, though, was how this space outperformed even my wildest expectations.
Active metadata and third-gen catalogs blew up even faster than I anticipated. In a massive shift from last year, when only a few people were talking about it, tons of companies from across the data ecosystem are now competing to claim this category. (Take, for example, Hevo Data and Castor’s adoption of the “Data Catalog 3.0” language.) A few have the tech to back up their talk. But like the early days of the data mesh, when experts and newbies alike appeared equally knowledgable in a space that was still being defined, others don’t.
Part of what made the space explode this year is how analysts latched onto and amplified this idea of modern metadata and data catalogs.
After its new Market Guide for Active Metadata in 2021, Gartner seems to have gone all in active metadata. At its conference this year, active metadata popped up as one of the key themes in Gartner’s keynotes, as well as in what seemed like half of the week’s talks across different topics and categories.
G2 released a new “Active Metadata Management” category in the middle of the year, marking a “new generation of metadata”. They even called this the “third phase of…data catalogs”, in keeping with this new “third-generation” language.
Similarly, Forrester scrapped its Wave report on “Machine Learning Data Catalogs” to make way for “Enterprise Data Catalogs for DataOps”, marking a major shift in their idea of what a successful data catalog should look like. As part of this, Forrester upended their Wave rankings, moving all of the previous Leaders to the bottom or middle tiers — a major sign that the market is starting to separate modern catalogs (e.g. active metadata platforms, data catalogs for DataOps, etc.) from traditional data catalogs.
The verdict: Didn’t come true ❌
TL;DR: As much as I wish this had come true, we made far less progress on this trend than I expected. Twelve months later, we’re pretty much where we started.
Where we started
Here’s what we said at the beginning of the year:
Of all the hyped trends in 2021, this is the one I’m most bullish on. I believe that in the next decade, data teams will emerge as one of the most important teams in the organization fabric, powering the modern, data-driven companies at the forefront of the economy.
However, the reality is that data teams today are stuck in a service trap, and only 27% of their data projects are successful. I believe the key to fixing this lies in the concept of the ‘data product’ mindset, where data teams focus on building reusable, reproducible assets for the rest of the team. This will mean investing in user research, scalability, data product shipping standards, documentation, and more.
Where we are now
Looking back on this one hurts. Of all my predictions, this one not coming true (yet? 🤞) makes me incredibly sad.
Despite the talk, we’re still so far from the reality of data teams operating as product teams. While data tech has matured a lot this year, we haven’t progressed much farther than we were last year on the human side of data. There just hasn’t been much progress on how data teams fundamentally operate — their culture, processes, etc.
The verdict: Mostly true ✅
TL;DR: As predicted, this space continued to expand and fragment itself this year. Where it will go next year, though, and whether it will merge with adjacent categories is still an open question.
Where we started
Here’s what we said at the beginning of this year:
I believe that in the past two years, data teams have realized that tooling to improve productivity is not a good-to-have but a must-have. After all, data professionals are one of the most sought-after hires you will ever make, so they shouldn’t be wasting their time on troubleshooting pipelines.
So will data observability be a key part of the modern data stack in the future? Absolutely. But will data observability continue to exist as its own category or will it be merged into a broader category (like active metadata or data reliability)? This is what I’m not so sure about.
Ideally, if you have all your metadata in one open platform, you should be able to leverage it for a variety of use cases (like data cataloging, observability, lineage and more). I wrote about that idea last year in my article on the metadata lake.
That being said, today, there’s a ton of innovation that these spaces need independently. My sense is that we’ll continue to see fragmentation in 2022 before we see consolidation in the years to come.
Where we are now
The big prediction was that this space would continue to grow, but in a fragmented rather than consolidated fashion — and that certainly happened.
Data observability has held its own and continued to grow in 2022. The number of players in this space has just continued to grow, with existing companies getting bigger, new companies becoming mainstream, and new tools launching every month.
For example, in company news, there were some major Series Ds (Monte Carlo with $135M, Unravel with $50M) and Series Bs (Edge Delta with $63M, and Manta with $35M) in this space.
As for tooling, Acceldata open-sourced its platform, Kensu launched a data observability solution, AWS introduced observability features into Amazon Glue 4.0, and Entanglement spun out another company focused on observability.
And in the thought leadership arena, both Monte Carlo and Kensu published major books with O’Reilly about data observability.
To make things more complicated, many industry-adjacent or early-stage companies have also been expanding and cement their role in this space. For example, after starting in the data quality space, Soda is now a major player in data observability. Similarly, Acceldata started in logs observability but now brands itself as “Data Observability for the Modern Data Stack”. Metaplane and Bigeye have also been growing in prominence since their launch and Series B, respectively, in 2021.
Like last year, I’m still not sure where data observability is heading — towards independence or a merge with data reliability, active metadata, or some other category. But at a high level, it seems that it’s moving closer to data quality, with a focus on ensuring high-quality data, rather than active metadata.
As we close out December 2022, it’s amazing to see how much the data world has changed.
It was just nine months ago in March that Data Council happened, where we debated the heck out of the data world. We put out all the hot takes on our tech, community, vibe, and future — because we could. We were in growth mode, looking for the next new thing and vying for a chunk of the seemingly infinite data pie.
Now we’re in a different world, one of recession and layoffs and budget cuts. We’re shifting from growth mode to efficiency mode.
Don’t get me wrong — we’re still in the golden age of data. Just a few weeks ago, Snowflake announced record revenue and 67% year-over-year growth.
But as data leaders, we’re facing new challenges in this golden age of data. As most companies start talking about efficiency, how can we think about using data to leverage the most efficiency in our work? What can data teams do to become the most valuable resource in their organizations?
I’m still trying to puzzle out how this will affect the modern data stack, and I can’t wait to share my thoughts soon. But the one thing I’m sure about is that 2023 will be a year to remember in the data world.
Our 2023 Future of the Modern Data Stack Report is out! Read it here or download the PDF.
Ready for spicy takes and expert insights on these trends? We assembled a panel of superstars (Bob Muglia, Barr Moses, Benn Stancil, Douglas Laney, and Tristan Handy) for the first Great Data Debate of 2023. Watch the recording here.
This blog was originally published on Towards Data Science.
Header photo: Mike Kononov on Unsplash