Modern metadata solutions, data quality frameworks, infrastructure, job roles, and other big changes are on their way.
Just like every other realm, 2020 upended the data world. When COVID shut down businesses and sent employees to work from home, companies had to quickly adapt to the “new normal”.
Cloud became an absolute necessity, as organizations moved to working remotely. Data governance and security became a big priority, with everyone accessing data from different locations and systems. Smarter AI became appealing, now that historical models were meaningless. In short, organizations realized they needed to make changes fast. Data investments went up and organizations sought to upgrade their systems and create the perfect data stack.
With 2020 in the rearview mirror, we’re now looking ahead to a new and hopefully better year. What will 2021 bring to the data world? How will data infrastructure evolve to keep up with all the latest innovations and changes?
This year, we’ll see several new data trends: the emergence of new data roles and data quality frameworks, the rise of the modern data stack and modern metadata solutions, and the convergence of data lakes and warehouses.
1. Data lakes and warehouses are converging
For the past decade, data architects have designed data operations around two key units:
- Data lakes: Cheap storage to store vast amounts of raw or even unstructured data. The data lake architecture is typically great for ad-hoc exploration and data science use cases.
- Data warehouses: Traditionally, data warehouses have optimized compute and processing speed. This is helpful for reporting and business intelligence, making warehouses the system of choice for analytics teams.
Today, many companies still use both systems — a data lake for all their data, plus specialized data warehouses for analytics and reporting use cases.
While we’re not there yet, we’re starting to see the two ecosystems converge as data lakes and warehouses both add more capabilities.
Data warehouses like Snowflake already separate costs for storage and compute, drastically reducing the costs associated with storing all your data on data warehouses. Taking this one step further, some data warehouse players have started adding support for semi-structured data.
On the other hand, data lake players like Databricks have started moving towards the concept of a “data lakehouse”, and they recently announced support for SQL analytics and ACID transactions.
Learn more:
- Data Lakehouses: An emerging system design that combines the data structures and management features from a data warehouse with the low-cost storage of a data lake.
- The Great Data Debate: A cool episode of the a16z podcast with thought-provoking notes about different technologies and architectures emerging in the data stack.
2. The “modern data stack” goes mainstream
Starting in 2020, the term “modern data stack” was everywhere you looked in the data world. It refers to the new, best-of-breed modern data architecture for dealing with massive amounts of data.
One of the key pillars of the modern data stack is a powerful cloud platform. Originally centered around cloud data warehouses, it’s also beginning to include cloud data lakes and associated data lake engines.
Today, the modern data stack refers to a suite of tools for every part of the data workflow:
- Data ingestion: e.g. Fivetran, Stitch, Hevodata
- Data warehousing: e.g. Snowflake, BigQuery
- Data lakes: e.g. Amazon S3
- Data lake processing: e.g. Presto, Dremio, Databricks, Starburst
- Data transformation: e.g. dbt, Matillion
- Metadata management: e.g. Atlan
- BI tools: e.g. Looker
Learn more:
- Emerging Architectures for Modern Data Infrastructure: An great, in-depth read about what technologies are winning in the modern data stack, based on interviews with 20+ practitioners.
- Modern Data Stack Conference 2020: Resources from Fivetran’s first modern data stack conference on the latest innovations, tools, and best practices.
- The Modern Data Stack Newsletter: A biweekly newsletter with blogs, guides, and podcasts on the modern data stack.
3. Metadata 3.0: metadata management is reborn
As the modern data stack matures, companies have undertaken ambitious projects to upgrade their data infrastructure and sort out basic data needs (i.e. ingesting data, wrapping up cloud migration projects, and setting up new BI tools). While these have unlocked a lot of potential, they’ve also created chaos.
Context questions like “What does this column name actually mean?” and “Why are the sales numbers on the dashboard wrong again?” kill the agility of teams that are otherwise moving at breakneck speed.
While these aren’t new questions, we’re on the cusp of new disruptive solutions. As modern data platforms are converging around five main players (AWS, Azure, Google Cloud Platform, Snowflake, and Databricks) and metadata itself is becoming big data, there’s significant potential for bringing intelligence and automation to the metadata space.
In the next 24 to 36 months, we’ll see the rise of one or more modern metadata management platforms built for the modern data stack, which solve for data discovery, data cataloging, data lineage, and observability.
Learn more:
- Data Catalog 3.0: My article on the past and future of metadata solutions, and why we’re about to make a huge leap forward in creating modern metadata for the modern data stack.
4. New roles emerge: Analytics Engineer and Data Platform Leader
2020 saw the rise of two roles that have become more mainstream than ever before.
1. Data Platform Leader
Organizations are increasingly realizing that there needs to be a central team responsible for developing data platforms that help the rest of the organization do their work better. And naturally, this team needs a leader.
In the past, this was handled by more traditional roles like Data Warehousing Specialists or Data Architects. Now it’s become common to have a data leader, who leads the data initiative across the organization. These people go by a range of titles, such as “Head of Data Platform” or “Director of Data Platforms”.
Data platform leaders typically oversee the modernization (or set-up from scratch, for startups) of a company’s data stack. This includes setting up a cloud data lake and warehouse, implementing a data governance framework, choosing a BI tool, and more.
This new role comes with an important new KPI: end user adoption. This refers to the leader’s ability to get people and teams within the organization to adopt data (and data platforms) in their daily workflows. This is a welcome change, as it aligns the incentives of those deciding what data products to invest in with those who ultimately use the products.
2. Analytics Engineer
Every analyst that I’ve spoken to in the past decade had one major frustration: dependency on data engineers for productionalization and setting up data pipelines.
The rise of powerful SQL-based pipeline building tools like dbt and Dataform has changed this for the better. By giving the analyst superpowers, they put the entire data transformation process in the hands of data analysts.
The result is the rise of the term “Analytics Engineer”, which describes former analysts who now own the entire data stack from ingestion and transformation to finally delivering usable data sets to the rest of the business.
Learn more:
- What Is an Analytics Engineer? An article from Claire Carroll at dbt about the why and how behind the new analytics engineering role.
5. Data quality frameworks are on the rise
Data quality is a space that hasn’t seen much innovation in the last two decades. However, it’s recently made significant strides, and different aspects of data quality are being incorporated throughout the data stack.
Data quality profiling
Data profiling is the process of reviewing data to understand its content and structure, check its quality, and identify how it can be used in the future.
Profiling can happen several times through a data asset’s lifecycle, ranging from shallow to in-depth assessments. It includes calculating the missing values, minimums and maximums, median and mode, frequency distribution, and other key statistical indicators that help users understand the underlying data quality.
While data quality profiling was typically a stand-alone product in the data stack, companies are increasingly incorporating it as a capability in modern data catalogs, enabling end users to understand and trust their data.
Business-driven data quality rules
Data quality isn’t just about the statistical understanding of data. It’s also about whether the data is trustworthy, based on business context.
For example, your sales numbers typically shouldn’t increase by more than 10% per week. A 100% spike in sales should alert the right team member and stop the data pipeline run, rather than making its way to a dashboard the CEO uses!
This need for intelligent alerts has led organizations to bring business teams into the process of writing data quality checks.
There still isn’t a great way for data teams to collaborate with business counterparts on data quality checks, but I expect this space will see a lot of innovation in the years to come. In the future, we’ll see smarter solutions that auto-generate business-driven data quality rules based on trends in the data.
Data quality tests in data pipelines
The third way that data quality is becoming common is writing it into the data pipeline itself. This borrows principles from “unit tests” in the software engineering world.
Software engineering has included unit testing frameworks for years. These automatically test each individual unit of code to make sure it’s ready to use. Data quality tests within the pipeline mimic unit testing frameworks to bring the same confidence and speed to data engineering.
This helps teams catch data quality issues caused by upstream data changes before it affects the organization’s workflows and reports.
Learn more:
- Amazon Deequ: Built internally at Amazon, Deequ is a promising open-source framework for data quality profiling.
- Great Expectations: This is emerging as a popular open-source community for data quality testing within data pipelines.
- Netflix’s Presentation on Scaling Data Quality: This is an interesting read for any data leader getting started on their data quality journey.
Do you agree or disagree with these trends? Spotted something that we’ve missed? Drop a comment with your insights!
This article was originally published in Towards Data Science.